IMPROVEMENT OF CLASSIFIER ACCURACY ON CLINICAL DATA SETS BY OPTIMAL SELECTION OF IMPUTATION METHODS
DOI:
https://doi.org/10.61841/y7qb4t23Keywords:
Missing value imputation, cardiovascular data, Mean imputation, Group mean imputation, kNN imputation, Multi-Linear Regression Imputation, C5.0, Random Forest, Performance MeasuresAbstract
Missing value imputation is one of the biggest tasks of data pre-processing when whenperforming data mining. Most clinical datasets are usually incomplete. Simply removing the incomplete cases from the original datasets can bring more problems than solutions. A suitable method for missing value imputation can help to producegood quality datasets for better analyzing clinical trials. In this paper we explore the use of a machine learning technique as a missing value imputation method for incomplete cardiovascular data. Mean imputation, group mean imputation, kNN imputation, and multi-linear regression imputation are used as missing value imputation, and the imputed datasets are subject to classification and prediction using C5.0 and random forest classifiers. The experiment shows that final classifier performance is improved when multi-linear regression imputation is used to predict missing attribute values for random forest, and in most cases, the machine learning techniques were found to perform better than the standard mean imputation technique.
Downloads
References
1. Rahman, M. M., and Davis, D. N. (2013). “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets.” IAENG Transactions on Engineering Technologies, Springer Netherlands, 245-257.
2. Mohammad Al Khaldy (2016)“Performance Analysis of Various Missing Value Imputation Methods on Heart Failure Dataset,” SAI Intelligent Systems Conference 2016, IEEE, September 20-22, 2016, London, UK
3. M.N.M. Salleh and N.A.t (2017), ” An Imputation for Missing Data Features Based on Fuzzy Swarm Approach in Heart Disease Classification,” © Springer International Publishing AG 2017, Y. Tan et al. (Eds.): ICSI 2017, Part II, LNCS 10386, pp. 285–292, 2017.
4. Dr. M. Sujatha, Anusha, and Gunda Bhavani (2018), “A STUDY ON PERFORMANCE CLEVELAND HEART DISEASE DATASET FOR IMPUTING MISSING VALUES,” International Journal of Pure and Applied Mathematics, Volume 120 No. 6, 2018, 7271-7280, ISSN: 1314-3395 (on-line version)
5. Anitha, Vanitha (2019), “Imputation Methods for Missing Data for a Proposed VASA Dataset,” International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-9 Issue-1, November 2019, Blue Eyes Intelligence Engineering & Sciences Publication
6. Taeyoung Kim, WoongKo and Jinho Kim (2019), “Analysis and Impact Evaluation of Missing Imputation in Day-ahead PV Generation Forecasting,” Appl. Sci. 2019, 9, 204; doi:10.3390/app9010204, Published: 8 January 2019
7. Anil Jadhav, Pramod & Krishnan Ramanathan (2019), “Comparison of Performance of Data Imputation Methods for Numeric Dataset,” Applied Artificial Intelligence 33:10, 913-933, DOI: 10.1080/08839514.2019.1637138, An International Journal ISSN: 0883-9514
(Print) 1087-6545 (Online) Journal homepage: https://www.tandfonline.com/loi/uaai20, Published online: 04 Jul 2019
8. AdityaSundararajan and Arif I. Sarwat (2019), “Evaluation of Missing Data Imputation Methods for an Enhanced Distributed PV Generation Prediction,” Springer Nature Switzerland AG 2020, K. Arai et al. (Eds.): FTC 2019, AISC 1069, pp. 590–609, 2020. https://doi.org/10.1007/978-3-030-32520-6_43
9. C. UshaNandhini and Dr. P. R. Tamilselvi, “An Ensemble Approach for Performance Analysis of Preprocessing Techniques on Classification for Heart Disease Datasets,” by IMRF, International Research Journals (UGC approved), 2018.
10. Meenakshi, Dr. Rajan Vohra, and Gimpsy (2014), “Missing Value Imputation in Multi-Attribute Data Set,” International Journal of Computer Science and Information Technologies, ISSN:0975-9646, Vol. 5(4), 2014, 5315-5321.
11. Jiawei Han and Micheline , Data Mining Concepts and Techniques, 2nd Edition, An imprint of Elsevier
12. Margaret Dunham, Data MiningIntroductory and Advanced Concepts, Pearson Education, 2014
13. C. UshaNandhini, Dr.P.R.Tamilselvi, “A Review on Feature Selection Approaches for Heart Disease Classification,” International Journal of Theoretical & Applied Sciences, Special Issue 10(1a): 63-67 (2018).
14. Jared P. Lander, R for Everyone-Advanced Analytics and Graphics, 2nd Edition, Pearson India Education Services Pvt., Ltd.,
15. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, “An Introduction to Statistical Learning with Applications in Springer Texts in Statistics, 1st Edition, 2017.
Downloads
Published
Issue
Section
License
Copyright (c) 2020 AUTHOR

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.