SURV667: Introduction to Record Linkage with Big Data Applications

Data Curation/Storage, Data Generating Process


Instructors: Manfred Antoni, Stefan Bender, Christian Borgs, Joseph W. Sakshaug

The demand for using different data sets in a “combined way” to analyse research questions is increasing. This is where record linkage comes into play as the common technique to integrate seperate data sets.

The course will provide an introduction to record linkage: it will address methods to combine data on given entities (people, households, firms etc.) that are stored in different data sources. By showing the strengths of these methods and by providing numerous practical examples ranging from linked survey and administrative data to Big Data applications, the course will demonstrate the various benefits of record linkage. Participants will also learn about potential challenges record linkage projects may face.

The schedule of the course will follow a prototypical record linkage process:

  • the need for common identifiers (e.g., names, addresses, birth dates) and the importance of assuring high data quality even during the planning phase of each project,
  • preparation of these identifiers before the actual linkage,
  • increasing the efficiency of the matching step (different blocking techniques),
  • alternative ways of conducting the comparison step, namely rule-based, distance-based and probabilistic record linkage,
  • as data protection requirements are an important issue in many applications, methods of privacy preserving record linkage are discussed,
  • evaluation and visualization of different quality aspects of the linkage result.

Numerous practical examples will give participants an opportunity to create and discuss their own ideas for promising record linkage projects. By the end of the course participants will be able to assess the feasibility of, plan and manage record linkage projects as well as to perform each step along the linkage process using the R software.

Course objectives: 

By the end of the course, students will…

  • be familiar with a host of record linkage applications from different countries or jurisdictions that link a variety of data sources and use different types of linkage
  • know how to improve the quality of linkage identifiers by applying pre-processing routines
  • be familiar with different methods of increasing the efficiency of record linkage
  • be able to understand, select and apply appropriate record linkage methods (e.g., deterministic and probabilistic linkage)
  • be able to evaluate the success of data linkage
  • be able to perform each step in the record linkage process using the R software

Grading will be based on:

  • 3 online quizzes (worth 10% each, 30% in total)
  • Participation in the weekly online meetings (20% of grade): engagement in discussions during the meetings and submission of questions in the forum on the course website (deadline: Tuesday, 1:00 PM EST/7:00 PM CET, i.e. 24 hours before each online meeting)
  • 3 homework assignments (worth 50% in total)

Dates of when assignment will be due are indicated in the syllabus. Extensions will be granted sparingly and only with prior arrangement with the instructors.


Students should have knowledge of basic statistical concepts. They need to have an intermediate knowledge of R. Familiarity with regular expressions, the R packages ggplot2 and tidyverse is useful but not required.


Christen, Peter (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer.

Bender, S., Jarmin, R., Kreuter, F. and Lane, J. (2017): Privacy and Confidentiality, In: Foster, I., Gahin, R., Jarmin, R. S., Kreuter, F. und Lane, J. (eds.): Big Data and Social Science – A Practical Guide to Methods and Tools, Chapter 11, p. 299-312, Chapman & Hall. 

Ghani, R. and Schierholz, M. (2017): Machine Learning, In: Foster, I., Gahin, R., Jarmin, R. S., Kreuter, F. and Lane, J. (eds.): Big Data and Social Science – A Practical Guide to Methods and Tools, Chapter 6, p. 147-186, Chapman & Hall. 

Schnell, R., Rukasz, D. (2019): PPRL: Privacy Preserving Record Linkage.

Schnell, R., Bachteler, T. and Reiher, J. (2009): Privacy-preserving Record Linkage Using Bloom Filters. BMC Medical Informatics and Decision Making 9 (41).

Vatsalan D., Christen P. and Verykios, V. S. (2013): A taxonomy of privacy-preserving record linkage techniques. Journal of Information Systems.

Weekly online meetings & assignments:

  • Week 1: Introducing record linkage in the age of Big Data
  • Week 2: Collecting and pre-processing linkage identifiers & blocking techniques (Quiz 1)
  • Week 3: Data preprocessing and core concepts of data quality for linking (Homework Assignment 1)
  • Week 4: Comparison and classification of record pairs (Quiz 2)
  • Week 5: Probabilistic record linkage and blocking (application) (Homework Assignment 2)
  • Week 6: Advanced topics, software options and literature review (Quiz 3)
  • Week 7: Privacy-preserving record linkage using R 
  • Week 8: Evaluation and visualization of linkage quality (Homework Assignment 3)

Course Dates


Summer Term (June – August)


Summer Term (June – August)