Instructors: Manfred Antoni, Stefan Bender, Christian Borgs, Joseph W. Sakshaug
The demand for using different data sets in a “combined way” to analyse research questions is increasing. This is where record linkage comes into play as the common technique to integrate seperate data sets.
The course will provide an introduction to record linkage: it will address methods to combine data on given entities (people, households, firms etc.) that are stored in different data sources. By showing the strengths of these methods and by providing numerous practical examples ranging from linked survey and administrative data to Big Data applications, the course will demonstrate the various benefits of record linkage. Participants will also learn about potential challenges record linkage projects may face.
The schedule of the course will follow a prototypical record linkage process:
Numerous practical examples will give participants an opportunity to create and discuss their own ideas for promising record linkage projects. By the end of the course participants will be able to assess the feasibility of, plan and manage record linkage projects as well as to perform each step along the linkage process using the R software.
By the end of the course, students will…
Grading will be based on:
Dates of when assignment will be due are indicated in the syllabus. Extensions will be granted sparingly and only with prior arrangement with the instructors.
Students should have knowledge of basic statistical concepts. They need to have an intermediate knowledge of R. Familiarity with regular expressions, the R packages ggplot2 and tidyverse is useful but not required.
Readings:
Christen, Peter (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer.
Bender, S., Jarmin, R., Kreuter, F. and Lane, J. (2017): Privacy and Confidentiality, In: Foster, I., Gahin, R., Jarmin, R. S., Kreuter, F. und Lane, J. (eds.): Big Data and Social Science – A Practical Guide to Methods and Tools, Chapter 11, p. 299-312, Chapman & Hall.
Ghani, R. and Schierholz, M. (2017): Machine Learning, In: Foster, I., Gahin, R., Jarmin, R. S., Kreuter, F. and Lane, J. (eds.): Big Data and Social Science – A Practical Guide to Methods and Tools, Chapter 6, p. 147-186, Chapman & Hall.
Schnell, R., Rukasz, D. (2019): PPRL: Privacy Preserving Record Linkage.
Schnell, R., Bachteler, T. and Reiher, J. (2009): Privacy-preserving Record Linkage Using Bloom Filters. BMC Medical Informatics and Decision Making 9 (41).
Vatsalan D., Christen P. and Verykios, V. S. (2013): A taxonomy of privacy-preserving record linkage techniques. Journal of Information Systems.
Weekly online meetings & assignments: