SURV701: Analysis of Complex Sample Survey Data

Data Analysis

Standard courses on statistical analysis assume that survey data are from a simple random sample of the target population. Little attention is given to characteristics often associated with survey data, including missing data, unequal probabilities of observation, and stratified multistage sample designs. Most standard statistical programs in software packages commonly used for data analysis (e.g., SAS®, SPSS®, and Stata®) do not allow the analyst to take most of these properties of survey data into account. Failure to do so can have an important impact on the estimation and inference for all types of analyses, ranging from simple descriptive statistics to the estimation of parameters of multivariate models. This course provides an introduction to procedures and software programs that have been developed for the analysis of complex sample survey data. The course begins by considering the sample designs of specific surveys: the National Comorbidity Survey Replication (NCS-R), the 2005-2006 National Health and Nutrition Examination Survey (NHANES), and the 2006 Health and Retirement Study (HRS). Relevant design features of the NCS-R, NHANES and HRS include weights that take into account differences in probability of selection into the sample and differences in subgroup response rates, in addition to the stratification and cluster sampling employed in the multistage sampling procedures used to select households and individuals. The course will then move on to the introduction of variance estimation techniques that have been developed to take into account the stratification and cluster sampling that are properties of the multistage sampling designs used by most major survey programs. These will initially be discussed in terms of the estimation of sampling variances for descriptive statistics: sample means, proportions and quantiles of distributions. The course syllabus will then turn to software and procedures for commonly used analyses, including testing for between-group differences in means and proportions, linear regression analysis for continuous dependent variables, contingency table analysis for categorical data and logistic regression for categorical responses, generalized linear models for ordinal and count data, survival analysis and multilevel modeling. We will also consider the consequences of nonresponse and missing data on survey analysis and methods for dealing with missing data. The SAS® and R systems for data management and analysis will be used to develop course examples and exercises. Data from the NCS-R, NHANES and HRS will be used to illustrate the various analysis procedures covered during the course.


The prerequisites for SURV 701 include one or more graduate courses in statistics, a course in applied sampling methods, or permission of the instructor. The course is presented at a moderately advanced statistical level. It will be assumed that the students are familiar with statistical methods, including multiple regression and logistic regression. The course syllabus and level of instruction also assume that students are familiar with basic sampling procedures, including simple random sampling, stratification, cluster sampling and multi-stage sample designs. Students who do not have graduate-level training in sampling techniques should expect to devote additional time during the first weeks of the course to supplemental readings on this topic.

