SURV702: Analysis of Complex Survey Data

Area: 
Data Analysis
Credit(s)/ECTS: 
2/4
Core/Elective: 
Elective

Apply through UMD

Instructor: Stefan Zins

Standard courses on statistical analysis assume that survey data are from a simple random sample of the target population. Little attention is given to characteristics often associated with survey data, including missing data, unequal probabilities of observation, and stratified multistage sample designs. Most standard statistical programs in software packages commonly used for data analysis (e.g., R, SAS, SPSS, and Stata) do not allow the analyst to take most of these properties of survey data into account. Failure to do so can have an important impact on the estimation and inference for all types of analyses, ranging from simple descriptive statistics to the estimation of parameters of multivariate models. This course provides an introduction to procedures and software programs that have been developed for the analysis of complex sample survey data, in particular R. The course begins by considering the sample designs of existing surveys such as the European Social Survey. Relevant design features include weights that take into account differences in probability of selection into the sample and differences in subgroup response rates, in addition to the stratification and cluster sampling employed in the multistage sampling procedures used to select households and individuals. The course will then move on to the introduction of variance estimation techniques that have been developed to take into account the stratification and cluster sampling that are properties of the multistage sampling designs used by most major survey programs. These will initially be discussed in terms of the estimation of sampling variances for descriptive statistics: sample means, proportions and quantiles of distributions. The course syllabus will then turn to software and procedures for commonly used analyses, including testing for between-group differences in means and proportions, linear regression analysis for continuous dependent variables, contingency table analysis for categorical data and logistic regression for categorical responses. We will also consider the consequences of non-sampling error. The R software for data management and analysis will be used to develop course examples and exercises. Data from surveys such at the European Social Survey will be used to illustrate the various analysis procedures covered during the course.

Course objectives: 

By the end of the course, students will…

  • understand the importance of accounting for the effects of complex sample designs on estimation and inference. 
  • be able to identify how sample design elements impact estimation and inference
  • be able to estimate sampling error using:

direct estimators
linearization techniques
replication methods

  • be able to account for complex sample designs in:

descriptive analysis for continuous variables
categorical data analysis
linear regression
logistic regression

  • be able to use R statistical software to account for the effects of complex sample designs
Grading: 

Grading will be based on: three criteria

  • Class participation (20%)
  • Online Discussion Posts (20%)
  • Completion of (4) homework assignments (60%)

Grades will be assigned on the following scale:

A+ 100 - 97

A 96 - 93 

A- 92 - 90 

B+ 89 - 87 

B 86 - 83 

B- 82 - 80 

Etc.

The homework assignments and online discussions are described in more detail below.  Dates of when assignment will be due are indicated in the syllabus. Late assignments will not be accepted without prior arrangement with the instructor.

Prerequisites: 

The prerequisites for SURV702 include one or more graduate courses in statistics covering techniques through OLS and logistic regression, a course in applied sampling methods (e.g. SURV625), or permission of the instructor. The course is presented at a moderately advanced statistical level. Although the course will review the fundamentals of statistical analysis methods for survey data and provide detailed examples on the use of statistical software, it will be assumed that the students are familiar with statistical methods, including multiple regression and logistic regression. The initial lectures in the course syllabus will review the various complex features of sample designs and how they influence 2 estimation and inference based on survey data. The course syllabus and level of instruction also assume that students are familiar with basic sampling procedures, including simple random sampling, stratification, cluster sampling and multi-stage sample designs. Students who do not have graduate-level training in sampling techniques should expect to devote additional time during the first weeks of the course to supplemental readings on this topic.

Readings:

Särndal, C. E., Swensson, B., & Wretman, J. (2003). Model assisted survey sampling. Springer Science & Business Media. (MASS)

Lohr, Sharon L. Sampling: design and analysis. CRC Press, 2019. (DA) A third textbook mainly covers applications using R:

Lumley, T. (2010). Complex Survey: A Guide to Analysis using R. New York: John Wiley & Sons. (CS)

Weekly online meetings & assignments:

  • Week 1: Sampling Theory
  • Week 2: Variance estimation techniques (Assignment 1)
  • Week 3: Survey Nonresponse (Quiz 1)
  • Week 4: Inference for Totals, Means and Quantiles I (Assignment 2)
  • Week 5: Inference for Totals, Means and Quantiles II (Quiz 2)
  • Week 6: Analyzing Categorial Survey Data (Assignment 3)
  • Week 7: Linear Regression with Survey Data (Assigment 4)
  • Week 8: Generalized Linear Regression with Survey Data
  • Final Exam

Course Dates

2020

Summer Term (June – August)

2021

Fall Semester (September – December)