SURV751: Introduction to Big Data and Machine Learning (ML I)

Area: 
Data Analysis
Credit(s)/ECTS: 
1/2
Core/Elective: 
Elective

Apply through UMD

Instructors: Frauke Kreuter, Trent D. Burskirk

The amount of data generated as a by-product in society is growing fast including data from satellites, sensors, transactions, social media and smartphones, just to name a few. Such data are often referred to as "big data", and can be used to create value in different areas such as health and crime prevention, commerce and fraud detection.  Big Data are often used for prediction and classification tasks. Both of which can be tackled with machine learning techniques. In this course we explore how Big Data concepts, processes and methods can be used within the context of Survey Research.  Throughout this course we will illustrate key concepts using specific survey research examples including tailored survey designs and nonresponse adjustments and evaluation. 

We will start with a discussion of key Big Data terminology and concepts. We place emphasis on understanding data generating processes and errors that can occur during these processes. Parallels between the errors in survey data collection and Big Data gatherings will be discussed. Special emphasis will be given to coverage error and measurement error. The key goal of any analytics task is information extraction and signal detection. Such task can look quite differently in the context of Big Data. We will compare common statistical methods to those use in the Big Data context and explain the difference in focus on prediction vs. causation. Most of the course time will be spend on general machine learning concepts, potential pitfalls, and the actual analytic processing steps when conducting applying techniques such as classification trees, random forests, conditional forests to process Big Data.

We use R and provide example code for the homework problems.

Course objectives: 

This course covers

  • an overview of key Big Data terminology and concepts
  • an introduction to common data generating processes
  • a discussion of some primary issues with linking Big Data with Survey Data
  • issues of coverage and measurement errors within the Big Data context
  • inference versus prediction
  • general concepts from machine learning including signal detection and information extraction
  • potential pitfalls for inference from Big Data
  • key analytic techniques (e.g. classification trees, random forests, conditional forests) to process Big Data using R with example code provided
Grading: 

Grading will be based on:

  • 4 online quizzes (worth 5% each)
  • Participation in discussion during the weekly online meetings and submission of questions via discussion form (deadline: Sunday, 1:00 PM EST/7:00 PM CET before class) demonstrating understanding of the required readings and video lectures (20% of grade). Obviously in the first week one question will be enough, since we just started.
  • 3 homework assignments (worth 20% each)

Dates of when assignment will be due are indicated in the syllabus. Late assignments will not be accepted without prior arrangement with the instructors.

Prerequisites: 

No prerequisites.

We recommend good understanding of the material typically taught in undergraduate statistics courses and some familiarity with regression techniques. Knowledge about survey data collection at the level provided in the IPSDS course Fundamentals of Survey and Data Science.
While not a prerequisite, familiarity with the R software package (base R or R using Rstudio) is strongly encouraged. 

Readings:

AAPOR (2015). AAPOR Report on Big Data.
Buskirk, T.D., Kirchner, A., Eck, A. and Signorino, C. (2018). An Introduction to Machine Learning Methods for Survey Researchers, Survey Practice, Vol. 11(1).

Kreuter, F., Peng, R. (2014). Extracting Information from Big Data: Issues of Measurement, Inference and Linkage. In Lane J. et al. (eds.) Privacy, Big Data, and the Public Good: Frameworks for Engagement. Cambridge University Press. Manuscript

Shmueli, G. (2010). To Explain or to Predict? Statistical Science 25 (3): 289–310.

Molinaro, A. M., Simon, R., Pfeiffer, R.M. (2005). Prediction error estimation: a comparison of resampling methods. In Bioinformatics. 21(15):3301-7.

Ghani, R., Schierholz, M. (2017). Machine learning. In: I. Foster et al. (eds.). Big data and social science. A practical guide to methods and tools, Boca Raton: CRC Press, pp. 147-186.

Buskirk, T.D. (2018). Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research. Survey Practice, Vol. 11(1).

Earp, M, Mitchell, M., McCarthy, J. and Kreuter, F. (2014). Modeling Nonresponse in Establishment Surveys: Using an Ensemble Tree Model to Create Nonresponse Propensity Scores and Detect Potential Bias in an Agricultural Survey, Journal of Official Statistics, Vol. 30(4), 701–719.

Buskirk, T.D. (2018). Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research. Survey Practice, Vol. 11(1).

Buskirk, T. D. & Kolenikov S. (2015). Finding Respondents in the Forest: A Comparison of Logistic Regression and Random Forest Models for Response Propensity Weighting and Stratification. Survey Insights: Methods from the Field, Weighting: Practical Issues and ‘How to’ Approach.

Weekly online meetings & assignments:

  • Week 1: Overview of Big Data; Working with Big Data; Classical Statistical Approaches versus Statistical Machine Learning (Quiz 1)
  • Week 2: Model Evaluation/Validation; K-Means Clustering (Quiz 2, Assignment 1)
  • Week 3: K Nearest Neighbors; CARTS (Quiz 3, Assignment 2)
  • Week 4: Random Forests (Quiz 4, Assignment 3)
Recommendations: 

If you want to dive even deeper into these topics, we recommend to sign up for the follow-up course SURV753 Machine Learning II.

Course Dates

2022

Spring Semester (January – May)