| Peer-Reviewed

Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data

Received: 8 May 2015     Accepted: 18 May 2015     Published: 29 May 2015
Views:       Downloads:
Abstract

Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.

Published in American Journal of Theoretical and Applied Statistics (Volume 4, Issue 3)
DOI 10.11648/j.ajtas.20150403.26
Page(s) 192-200
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2015. Published by Science Publishing Group

Keywords

Missingness, Missing at Random, Multiple Imputation, Expectation Maximization

References
[1] Alan Agresti. An Introduction to Categorical Data Analysis. John Wiley & Sons, Inc.,Hoboken, New Jersey, 2007
[2] Shu-Ching Chang and Hyung Jin Kim. Em algorithm. December 9, 2007.
[3] Dong and Peng. Principled missing data methods for researchers. Springler Plus, 2013.
[4] Joseph L.Shafer and John W. Graham. Missing data: Our view of the state of the art. Psychological Methods, 2002, 7, 147-177
[5] Yulei He. Missing data analysis using multiple imputation: Getting to the heart of the matter. National Institute of Health Public Access, January 1 2010.
[6] Nicholas J. Horton. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. National Institute of Health Public Access, February 2007.
[7] Tamara Brian Wilfried Laubach Jochen Hardt, Max Herke. Multiple imputation of missing data: A simulation study on a binary response. Open Journal of Statistics, 3:370_378, 2013..
[8] Ting Hsiang Lin. A comparison of multiple imputation with em algorithm and mcmc method for quality of life missing data. Springer Science + Business Media B.V., September 2008.
[9] Joseph L.Shaferand John W. Graham. Missing data: Our view of the state of the art. Psychological Methods, 7(2):147-177, January 2002.
[10] Show-Mann Liou Chao-Ying Joanne Peng, Michael Harwell and Lee H. Ehman. Advances in missing data method and implications for educational research. page 6, June 2003.
[11] J.W Graham. Missing Data Analysis and Design. Springer, 2012.
[12] Gabriele B. Durrant. Imputation mmethod for handling item-nonresponse in the social sciences. June 2005.
[13] Andrew Gelman Kobi Abayomi and Marc Levy. Diagnostics for multivariate imputations. Journal of the Royal Statistical Society, 57:273291, November 2008.
Cite This Article
  • APA Style

    Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya. (2015). Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. American Journal of Theoretical and Applied Statistics, 4(3), 192-200. https://doi.org/10.11648/j.ajtas.20150403.26

    Copy | Download

    ACS Style

    Shelmith Nyagathiri Kariuki; Anthony Waititu Gichuhi; Anthony Kibira Wanjoya. Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. Am. J. Theor. Appl. Stat. 2015, 4(3), 192-200. doi: 10.11648/j.ajtas.20150403.26

    Copy | Download

    AMA Style

    Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya. Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. Am J Theor Appl Stat. 2015;4(3):192-200. doi: 10.11648/j.ajtas.20150403.26

    Copy | Download

  • @article{10.11648/j.ajtas.20150403.26,
      author = {Shelmith Nyagathiri Kariuki and Anthony Waititu Gichuhi and Anthony Kibira Wanjoya},
      title = {Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data},
      journal = {American Journal of Theoretical and Applied Statistics},
      volume = {4},
      number = {3},
      pages = {192-200},
      doi = {10.11648/j.ajtas.20150403.26},
      url = {https://doi.org/10.11648/j.ajtas.20150403.26},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajtas.20150403.26},
      abstract = {Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.},
     year = {2015}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data
    AU  - Shelmith Nyagathiri Kariuki
    AU  - Anthony Waititu Gichuhi
    AU  - Anthony Kibira Wanjoya
    Y1  - 2015/05/29
    PY  - 2015
    N1  - https://doi.org/10.11648/j.ajtas.20150403.26
    DO  - 10.11648/j.ajtas.20150403.26
    T2  - American Journal of Theoretical and Applied Statistics
    JF  - American Journal of Theoretical and Applied Statistics
    JO  - American Journal of Theoretical and Applied Statistics
    SP  - 192
    EP  - 200
    PB  - Science Publishing Group
    SN  - 2326-9006
    UR  - https://doi.org/10.11648/j.ajtas.20150403.26
    AB  - Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.
    VL  - 4
    IS  - 3
    ER  - 

    Copy | Download

Author Information
  • Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya

  • Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya

  • Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya

  • Sections