SURV726: Multiple Imputation – Why and How

Area: 
Data Analysis
Credit(s)/ECTS: 
1/2
Core/Elective: 
Elective

Missing data are a common problem in surveys which can lead to biased results if the missingness is not taken into account at the analysis stage. Multiple imputation is widely accepted as the most convenient strategy for dealing with item nonresponse in a proper way and most statistical software packages offer routines to multiply impute missing values these days. However, when treating the imputation process as a black box relying on the default settings of the software the cure can be worse than the disease. The main aim of the course therefore will be to illustrate the usefulness (and limitations) of the approach and enable the students to come up with sensible imputation strategies when dealing with item nonresponse in large scale surveys.

The course will emphasize practical implementation and tricks for handling real data problems over detailed proofs regarding the underlying methodology, although we will provide some motivation for the analysis procedures for multiply imputed datasets and briefly touch on some of the methodological pitfalls of the approach.

The course will start by illustrating why the concept should generally be preferred over standard methods which impute missing values only once (single imputation). We will also present some intuition for “Rubin’s combining rules” required to obtain valid inferences from the imputed data. In the next unit we will learn about the two main strategies for multiple imputation – joint modeling and sequential regression – and discuss the pros and cons of the two approaches. Based on these two approaches we will discuss the different modeling strategies for imputing continuous and (un)ordered categorical variables. We will also present some nonparametric alternatives. To make the imputation approach feasible in practice the course will also cover strategies for dealing with real data problems such as logical constraints between the variables or skip patterns that are common in most questionnaires. The final section will provide insights how to evaluate the quality of the imputed data.

Course objectives: 

By the end of the course, students will…

  • understand why multiple imputation should be preferred over single imputation methods in most situations
  • know about the two main approaches for multiple imputation
  • be familiar with various imputation routines for different types of variables
  • know how to implement these routines using R
  • be able to deal with various problems that typically arise when imputing large scale surveys
  • know about various strategies to assess the quality of the generated imputations
Grading: 

Grading will be based on:

  • 2 online quizzes (worth 20% total)
  • 2 homework assignments (40% total)
  • Participation in the weekly online meetings, engagement in discussions during the meetings and/or submission of questions via e-mail (10% of grade)
  • A final online exam (30% of grade)

Dates of when assignment will be due are indicated in the syllabus. Late assignments will not be accepted without prior arrangement with the instructor.

Prerequisites: 

Students should be familiar with generalized linear models and basic probability theory. We also expect that students know the basic concepts for dealing with nonresponse in surveys (the difference between item and unit nonresponse, formalizing the missing data mechanism, deterministic and stochastic approaches for imputation). For students unfamiliar with these concepts we highly recommend to enroll in the course “Nonresponse and Imputation” before participating in this course.

Some background knowledge in Bayesian statistics and Markov Chain Monte Carlo Methods (MCMC) is helpful but not mandatory. The statistical software R will be used for illustrations and for (some of) the homework assignments.

Course syllabus: 

Course Dates

2017

Winter Term (December – February)

2019

Winter Term (December – February)