SURV726: Multiple Imputation – Why and How

Area: 
Data Analysis, Data Curation/Storage
Credit(s)/ECTS: 
1/2
Core/Elective: 
Elective

Apply through UMD

Instructor: Jörg Drechsler

Missing data are a common problem in surveys which can lead to biased results if the missingness is not taken into account at the analysis stage. Multiple imputation is widely accepted as the most convenient strategy for dealing with item nonresponse in a proper way and most statistical software packages offer routines to multiply impute missing values these days. However, when treating the imputation process as a black box relying on the default settings of the software the cure can be worse than the disease. The main aim of the course therefore will be to illustrate the usefulness (and limitations) of the approach and enable the students to come up with sensible imputation strategies when dealing with item nonresponse in large scale surveys.

The course will emphasize practical implementation and tricks for handling real data problems over detailed proofs regarding the underlying methodology, although we will provide some motivation for the analysis procedures for multiply imputed datasets and briefly touch on some of the methodological pitfalls of the approach.

The course will start by illustrating why the concept should generally be preferred over standard methods which impute missing values only once (single imputation). We will also present some intuition for “Rubin’s combining rules” required to obtain valid inferences from the imputed data. In the next unit we will learn about the two main strategies for multiple imputation – joint modeling and sequential regression – and discuss the pros and cons of the two approaches. Based on these two approaches we will discuss the different modeling strategies for imputing continuous and (un)ordered categorical variables. We will also present some nonparametric alternatives. To make the imputation approach feasible in practice the course will also cover strategies for dealing with real data problems such as logical constraints between the variables or skip patterns that are common in most questionnaires. The final section will provide insights how to evaluate the quality of the imputed data.

Course objectives: 

By the end of the course, students will…

  • understand why multiple imputation should be preferred over single imputation methods in most situations
  • know about the two main approaches for multiple imputation
  • be familiar with various imputation routines for different types of variables
  • know how to implement these routines using R
  • be able to deal with various problems that typically arise when imputing large scale surveys
  • know about various strategies to assess the quality of the generated imputations
Grading: 

Grading will be based on:

  • 2 online quizzes (worth 20% total)
  • 2 homework assignments (40% total)
  • Participation in the weekly online meetings, engagement in discussions during the meetings and/or submission of questions via e-mail (10% of grade)
  • A final online exam (30% of grade)

Dates of when assignment will be due are indicated in the syllabus. There will be a grace period for late assignments (not for quizzes), but late assignments will be penalized according to the following rules:

  • 1 day late:      10% off
  • 2 days late:    25% off
  • 3 days late:    50% off
  • 4+ days late:  no credit
Prerequisites: 

Students should be familiar with generalized linear models and basic probability theory. We also expect that students know the basic concepts for dealing with nonresponse in surveys (the difference between item and unit nonresponse, formalizing the missing data mechanism, deterministic and stochastic approaches for imputation). For students unfamiliar with these concepts we highly recommend to enroll in the course SURV725 Item Nonresponse and Imputation before participating in this course.

Some background knowledge in Bayesian statistics and Markov Chain Monte Carlo Methods (MCMC) is helpful but not mandatory. The statistical software R will be used for illustrations and for (some of) the homework assignments. Thus, basic knowledge of R is required to be able to complete the assignments.

Readings: 

Carpenter, J. and Kenward, M. (2012). Multiple imputation and its application.   New York: John Wiley & Sons. 

Schafer, J. L. (1999). Multiple imputation: a primer. Statistical methods in medical research 8, 3-15.

Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85-96.

Kropko, J., Goodrich, B., Gelman, A., & Hill, J. (2014). Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches. Political Analysis 22, 497-519.

Abayomi, K., Gelman, A., and Levy, M. (2008). Diagnostics for multivariate imputations. Journal of the Royal Statistical Society, Series C 57, 273-291.

Weekly online meetings & assignments:

  • Week 1: MI Intro & MI Analysis 
  • Week 2: MI for Continuous Variables 
  • Week 3: MI for Categorical Variables and Nonparametric Alternatives
  • Week 4: Modeling Strategies and Quality Evaluations

Course Dates

2020

Spring Semester (January – May)

2022

Fall Semester (September – December)