SURV735: Data Confidentiality and Statistical Disclosure Control

Data Output/Access


Instructor: Jörg Drechsler

Statistical agencies and other data collecting institutions constantly face the dilemma between providing broad access to their data and maintaining the confidentiality of the individuals included in the collected data. To address this trade-off various statistical disclosure control (SDC) methods have been developed which help to ensure that no sensitive information can be disclosed based on the disseminated data. However, applying these methods usually comes at the price of information loss or potentially biased inferences based on the protected data.

This course will introduce the data protection strategies that are commonly used by statistical agencies and discuss their advantages and limitations. We will also briefly look at the computer science perspective on data privacy. We will discuss the differences to the SDC perspective and what the SDC community could learn from the approaches developed in computer science. The main part of the course will focus on a relatively new approach to statistical disclosure control that has been implemented successfully for some data products recently: Generating synthetic data. With this approach statistical models are fitted to the original data and draws from these models are released instead of the original data. If the synthesis models are selected carefully, most of the relationships found in the original data are preserved.

You will learn about the general idea of synthetic data and the two main approaches for generating synthetic datasets. The close relationship to multiple imputation for nonresponse will also be discussed.

The quality of the synthetic data crucially depends on the quality of the models used for generating the data. Thus, the course will present various parametric and nonparametric modeling strategies in great detail.

The quality needs to be evaluated in two dimensions: (i) How well is the analytical validity preserved, i.e. how close are analysis results based on the synthetic data to results obtained from the original data? (ii) What is the remaining risk of disclosure for the released data?

Several strategies to measure these two dimensions will be introduced. All steps of the synthesis process from generating the data, over analyzing the data, to evaluating the analytical validity and disclosure risk will be illustrated using simulated and real data examples in R.

Course objectives: 

By the end of the course participants will

  • know which measures are typically taken by statistical agencies to guarantee confidentiality for the survey respondents if data are disseminated to the public.
  • be aware of potential limitations of these measures.
  • have a practical understanding of the concept of synthetic data.
  • be able to judge in which situations the approach could be useful.
  • know how to generate synthetic data from their own data.
  • have a number of tools available to evaluate the analytical validity of the synthetic datasets.
  • know how to assess the disclosure risk of the generated data.

Grading will be based on:

  • 2 quizzes (worth 15% total)
  • Participation in the weekly online meetings, engagement in discussions during the meetings and/or submission of questions via e-mail (10% of grade)
  • Three homework assignments (45%)
  • A final online exam (30% of grade)

Dates of when assignment will be due are indicated in the syllabus. There will be a grace period for late assignments (not for quizzes), but late assignments will be penalized according to the following rules:

  • 1 day late:      10% off
  • 2 days late:    25% off
  • 3 days late:    50% off
  • 4+ days late:  no credit

The statistical software R will be used for illustrations and for (some of) the homework assignments. Thus, knowledge of R is required to be able to complete the assignments. Some background regarding general linear modelling is expected. Familiarity with the concept of Bayesian statistics is helpful but not required.


Reiter, J. P. (2012). Statistical approaches to protecting confidentiality for microdata and their effects on the quality of statistical inferences. Public Opinion Quarterly 76, 163–181.

Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database. International Statistical Review 79, 363–384.

Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468

Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, 1–16.

Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189.

Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96.

Reiter, J. P. (2005). Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics 21, 441–462.

Schenker, N., Raghunathan, T. E., Chiu, P. L., Makuc, D. M., Zhang, G., and Cohen, A. J. (2006). Multiple imputation of missing income data in the National Health Interview Survey. Journal of the American Statistical Association 101, 924–933.

Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P., and Sanil, A. P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, 224–232.

Abayomi, K., Gelman, A., and Levy, M. (2008). Diagnostics for multivariate imputations. Journal of the Royal Statistical Society, Series C 57, 273–291

Drechsler, J. and Reiter, J. P. (2008). Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In J. Domingo-Ferrer and Y. Saygin, eds., Privacy in Statistical Databases, 227–238. New York: Springer.

Reiter, J. P. and Mitra, R. (2009). Estimating risks of identification disclosure in partially synthetic data. Journal of Privacy and Confidentiality, 1(1), Article 6.

Weekly online meetings & assignments:

  • Week 1: A Brief History of Data Confidentiality & Traditional Approaches for Data Protection
  • Week 2: The Computer Science Perspective on Data Privacy & Introduction to Multiply Imputed Synthetic Datasets (Assignment 1)
  • Week 3:  Analyzing Synthetic Datasets & Relationship to Multiple Imputation for Nonresponse (Quiz 1) 
  • Week 4: Synthesis Models Part I: Univariate and Linear Regression Models (Assignment 2)
  • Week 5: Synthesis Models Part II (Models for Categorical Variables and Nonparametric Models) & Modeling Strategies (Quiz 2) 
  • Week 6: Analytical Validity & Disclosure Risk Part I (Theory) (Assignment 3)
  • Week 7: Disclosure Risk Part II (Examples in R) & Discussion of the Chances and Obstacles of the Synthetic Data Approach
  • Week 8: Discussion of the Third Homework Assignment 
  • Final Exam

Course Dates


Spring Semester (January – May)


Spring Semester (January – May)