Improving Classification Performance of Imbalanced Data Using SMOTE: empirical studies

Ulfasari Rafflesia; Dedi Rosadi

doi:10.38114/riemann.v8i1.199

Authors

Ulfasari Rafflesia Universitas Bengkulu Author https://orcid.org/0000-0002-7739-3952
Dedi Rosadi Xiamen University Malaysia Author https://orcid.org/0000-0003-2689-253X

DOI:

https://doi.org/10.38114/riemann.v8i1.199

Keywords:

Imbalanced data, SMOTE, SVM, Decision Tree, AdaBoost

Abstract

Data balancing methods in multi-class settings continue to evolve as the importance of balanced data conditions for classification analysis grows. However, limited studies have provided comprehensive empirical comparisons across both binary and multi-class imbalanced datasets. Data imbalance can affect model predictions, particularly by leading to inaccurate identification of minority classes. Therefore, this study aims to evaluate the effectiveness of the Synthetic Minority Over-sampling Technique (SMOTE) in improving classification performance. Three benchmark datasets from the UCI Machine Learning Repository—Breast Cancer, Ecoli, and Glass—were selected to represent imbalanced classification problems in both binary and multi-class settings. The proposed framework addresses class imbalance during data preprocessing using SMOTE. Each dataset is first divided into training and testing subsets. SMOTE is applied only to the training data to address class imbalance, while the test data is kept unchanged for evaluation. Then, the classification process is applied to the original (imbalanced) data and to the balanced data generated by SMOTE. The classifiers used in this study are SVM, a decision tree, and AdaBoost. The classification results are evaluated based on accuracy, sensitivity, and F1-score. The results show that the decision tree and AdaBoost improve classification performance under imbalanced data conditions. In particular, AdaBoost achieves the best overall performance in terms of prediction accuracy and class balance, demonstrating the effectiveness of combining SMOTE with ensemble methods for handling imbalanced datasets.

Downloads

Download data is not yet available.

Author Biography

Dedi Rosadi, Xiamen University Malaysia

Department of Mathematics

References

Bian, J., Wang, J., & Yece, Q. (2024). A novel study on power consumption of an HVAC system using CatBoost and AdaBoost algorithms combined with the metaheuristic algorithms. Energy, 302, 131841. https://doi.org/10.1016/j.energy.2024.131841 DOI: https://doi.org/10.1016/j.energy.2024.131841

Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14(1), 1–16. https://link.springer.com/article/10.1186/1471-2105-14-106 DOI: https://doi.org/10.1186/1471-2105-14-106

Boonamnuay, S., Kerdprasop, N., & Kerdprasop, K. (2018). Classification and regression tree with resampling for classifying imbalanced data. International Journal of Machine Learning and Computing, 8(4), 336–340. https://doi.org/10.18178/ijmlc.2018.8.4.708 DOI: https://doi.org/10.18178/ijmlc.2018.8.4.708

Cao, Y., Miao, Q., Liu, J., & Gao, L. (2013). Advance and Prospects of AdaBoost Algorithm. Acta Automatica Sinica, 39(6), 745–758. https://doi.org/10.1016/S1874-1029(13)60052-X DOI: https://doi.org/10.1016/S1874-1029(13)60052-X

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953 DOI: https://doi.org/10.1613/jair.953

Chawla, N. V., Cieslak, D. A., Hall, L. O., & Joshi, A. (2008). Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 17(2), 225–252. https://doi.org/10.1007/s10618-008-0087-0 DOI: https://doi.org/10.1007/s10618-008-0087-0

Chen, W., Yang, K., Yu, Z., & Zhang, W. (2022). Double-kernel based class-specific broad learning system for multiclass imbalance learning. Knowledge-Based Systems, 253, 109535. https://doi.org/10.1016/j.knosys.2022.109535 DOI: https://doi.org/10.1016/j.knosys.2022.109535

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018 DOI: https://doi.org/10.1023/A:1022627411411

Domingo, C., & Watanabe, O. (2000). MadaBoost: A modification of AdaBoost. In Colt, 1, 1–26.

Fernandez, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863–905. https://doi.org/10.1613/jair.1.11192 DOI: https://doi.org/10.1613/jair.1.11192

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 42(4), 463–484. https://doi.org/10.1109/TSMCC.2011.2161285 DOI: https://doi.org/10.1109/TSMCC.2011.2161285

Gamil, S., Zeng, F., Alrifaey, M., Asim, M., & Ahmad, N. (2024). An Efficient AdaBoost Algorithm for Enhancing Skin Cancer Detection and Classification. Algorithms, 17(8), 353. https://doi.org/10.3390/a17080353 DOI: https://doi.org/10.3390/a17080353

Ha, J., Kambe, M., & Pe, J. (2011). Data Mining: Concepts and Techniques. In Data Mining: Concepts and Techniques. https://doi.org/10.1016/C2009-0-61819-5 DOI: https://doi.org/10.1016/C2009-0-61819-5

Haties, T., Tibshirani, R., & Friedman, J. (2009). Springer Series in Statistics. In The Elements of Statistical Learning (Vol. 27, Issue 2). http://www.springerlink.com/index/D7X7KX6772HQ2135.pdf

Hothorn, T., Hornik, K., Wien, W., & Zeileis, A. (2015). ctree: Conditional Inference Trees. The Comprehensive R Archive Network;, Quinlan 1993, 1–34. https://mirrors.nics.utk.edu/cran/web/packages/partykit/vignettes/ctree.pdf

Jiang, X., Xu, Y., Ke, W., Zhang, Y., Zhu, Q.-X., & He, Y.-L. (2022). An Imbalanced Multifault Diagnosis Method Based on Bias Weights AdaBoost. IEEE Transactions on Instrumentation and Measurement, 71, 1–8. https://doi.org/10.1109/TIM.2022.3149097 DOI: https://doi.org/10.1109/TIM.2022.3149097

Johnson, R. A. and Wichern, D. A. (2002). Applied multivariate statistical analysis. In Prentice Hall.

Kumar, Y., Kaur, K., & Singh, G. (2020). Machine Learning Aspects and its Applications Towards Different Research Areas. 2020 International Conference on Computation, Automation and Knowledge Management (ICCAKM), 150–156. https://doi.org/10.1109/ICCAKM46823.2020.9051502 DOI: https://doi.org/10.1109/ICCAKM46823.2020.9051502

Last, F., Douzas, G., & Bacao, F. (2017). Oversampling for Imbalanced Learning Based on K-Means and SMOTE. Information Sciences, 465, 1–20. https://doi.org/10.1016/j.ins.2018.06.056 DOI: https://doi.org/10.1016/j.ins.2018.06.056

Maulud, D., & Abdulazeez, A. M. (2020). A Review on Linear Regression Comprehensive in Machine Learning. Journal of Applied Science and Technology Trends, 1(2), 140–147. https://doi.org/10.38094/jastt1457 DOI: https://doi.org/10.38094/jastt1457

Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., & Brown, S. D. (2004). An introduction to decision tree modeling. Journal of Chemometrics, 18(6), 275–285. https://doi.org/10.1002/cem.873 DOI: https://doi.org/10.1002/cem.873

Park, J., & Sandberg, I. W. (1991). Universal Approximation Using Radial-Basis-Function Networks. Neural Computation, 3(2), 246–257. https://doi.org/10.1162/neco.1991.3.2.246 DOI: https://doi.org/10.1162/neco.1991.3.2.246

Pratap Chandra, S., Hajra, M., & Ghosh., M. (2020). Supervised classification algorithms in machine learning: A survey and review. In Advances in Intelligent Systems and Computing (Vol. 937). https://doi.org/10.1007/978-981-13-7403-6_7 DOI: https://doi.org/10.1007/978-981-13-7403-6_7

Quinlan, J. R. (1996). Learning decision tree classifiers. ACM Computing Surveys (CSUR), 28(1), 71–72. https://doi.org/10.1145/234313.234346 DOI: https://doi.org/10.1145/234313.234346

Rezvani, S., & Wang, X. (2023). A broad review on class imbalance learning techniques. Applied Soft Computing, 143, 110415. https://doi.org/10.1016/j.asoc.2023.110415 DOI: https://doi.org/10.1016/j.asoc.2023.110415

Rosadi, D., Arisanty, D., Andriyani, W., Peiris, S., Agustina, D., Dowe, D., & Fang, Z. (2021). Improving Machine Learning Prediction of Peatlands Fire Occurrence for Unbalanced Data Using SMOTE Approach. 2021 International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA), 160–163. https://doi.org/10.1109/DATABIA53375.2021.9650084 DOI: https://doi.org/10.1109/DATABIA53375.2021.9650084

Tao, P., Sun, Z., & Sun, Z. (2018). An Improved Intrusion Detection Algorithm Based on GA and SVM. IEEE Access, 6, 13624–13631. https://doi.org/10.1109/ACCESS.2018.2810198 DOI: https://doi.org/10.1109/ACCESS.2018.2810198

Wang, R. (2012). AdaBoost for Feature Selection, Classification and Its Relation with SVM, A Review. Physics Procedia, 25, 800–807. https://doi.org/10.1016/j.phpro.2012.03.160 DOI: https://doi.org/10.1016/j.phpro.2012.03.160

Wang, S., Dai, Y., Shen, J., & Xuan, J. (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports, 0123456789, 1–11. https://doi.org/10.1038/s41598-021-03430-5 DOI: https://doi.org/10.1038/s41598-021-03430-5

Yang Wendong, Lou Zhengzheng, & Ji Bo. (2017). A multi-factor analysis model of quantitative investment based on GA and SVM. 2017 2nd International Conference on Image, Vision and Computing (ICIVC), 1152–1155. https://doi.org/10.1109/ICIVC.2017.7984734 DOI: https://doi.org/10.1109/ICIVC.2017.7984734

Improving Classification Performance of Imbalanced Data Using SMOTE: empirical studies

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Sertifikat Areditasi

sidebar

Quick Menus

template

Article Template

Tools

Latest publications

Keywords