SURV667: Introduction to Record Linkage with Big Data Applications

Data Curation/Storage, Data Generating Process

The course will provide an introduction to record linkage: it will address methods to combine data on given entities (people, households, firms etc.) that are stored in different data sources. By showing the strengths of these methods and by providing numerous practical examples ranging from linked survey and administrative data to Big Data applications, the course will demonstrate the various benefits of record linkage. The participants will also learn about potential pitfalls record linkage projects may face.

The schedule of the course will be following a prototypical record linkage process:

  • the need for common identifiers (e.g., names, addresses, birth dates) and the importance of assuring high data quality even during the planning phase of each project.
  • preparation of these identifiers before the actual linkage.
  • increase the efficiency of the matching step (different blocking techniques).
  • alternative ways of conducting the matching step, namely rule-based, distance-based and probabilistic record linkage.
  • as data protection requirements are an important issue in many applications, methods of privacy preserving record linkage are discussed.
  • the multitude of suitable software products and their specific capabilities in dealing with record linkage problems.

Numerous practical examples will give participants an opportunity to create and discuss own ideas for promising record linkage projects. By the end of the course participants will enable to assess the feasibility of, plan and manage record linkage projects as well as to perform each step along an actual linkage process.

Course objectives: 

By the end of the course, students will…

  • •be familiar with a host of record linkage applications from different countries or jurisdictions that link a variety of data sources and use different types of linkage
  • know how to improve the quality of linkage identifiers by applying pre-processing routines
  • be familiar with different methods of increasing the efficiency of record linkage
  • •be able to understand, select and apply appropriate record linkage methods (e.g.,deterministic and probabilistic linkage)
  • •be familiar with packages for record linkage in R
  • be able to evaluate the success of data linkage

Grading will be based on:

  • 3 online quizzes (worth 10% each, 30% in total)
  • Participation in the weekly online meetings (20% of grade): engagement in discussions during the meetings and submission of questions in the forum on the course website (deadline: Monday, 1:00 PM EST/7:00 PM CET, i.e. 24 hours before the online meeting)
  • 2 homework assignments (worth 25% each, 50% in total)

Dates of when assignment will be due are indicated in the syllabus. Extensions will be granted sparingly and only with prior arrangement with the instructors.


Students should have knowledge of basic statistical concepts. They should have an advanced knowledge of R or Stata. A basic understanding of regular expressions is useful but not strictly required.

Course syllabus: 

Course Dates


Winter Term (December – February)


Winter Term (December – February)