SURV753: Machine Learning II (ML II)

Area: 
Data Analysis
Credit(s)/ECTS: 
2/4
Core/Elective: 
Elective

Registration

Instructors: Christoph Kern, Trent D. Buskirk

Social scientists and survey researchers are confronted with an increasing number of new data sources such as apps and sensors that often result in (para)data structures that are difficult to handle with traditional modeling methods. At the same time, advances in the field of machine learning (ML) have created an array of flexible methods and tools that can be used to tackle a variety of modeling problems. Against this background, this course discusses advanced ML concepts such as cross validation, class imbalance, Boosting and Stacking as well as key approaches for facilitating model tuning and performing feature selection.  In this course we also introduce additional machine learning methods including Support Vector Machines, Extra-Trees and LASSO among others. The course aims to illustrate these concepts, methods and approaches from a social science perspective. Furthermore, the course covers techniques for extracting patterns from unstructured data as well as interpreting and presenting results from machine learning algorithms. Code examples will be provided using the statistical programming language R.

The course is structured such that each session focuses on specific prediction tasks and presents tools that can be used to tackle modeling problems in this setting. Topics include, e.g., accounting for informative data structures in the context of model training and tuning, dealing with class imbalance in categorical outcomes, building effective prediction models by applying cutting edge ML methods, and performing feature selection in high-dimensional data settings. The presented methods will be motivated from a social and survey science perspective and critically discussed with respect to their advantages and limitations.

Code examples will be provided using the statistical programming language R.

Course objectives: 

By the end of the course, students will…

  • will have a profound understanding of advanced (ensemble) prediction methods
  • have built up a comprehensive ML toolkit to tackle various learning problems
  • know how to (critically) evaluate and interpret results from ''black-box'' models
Grading: 

Grading will be based on

  • 4 homework assignments (10% each)
  • 8 online quizzes (5% each)
  • Participation in discussion during the weekly online meetings (20% of grade)

Dates of when assignment will be due are indicated on Canvas. Late assignments will not be accepted without prior arrangement with the instructors.

Prerequisites: 

Topics covered in SURV751: Introduction to Machine Learning and Big Data (ML I), i.e.:

  • Conceptual basics of machine learning (training vs. test data, model evaluation basics)
  • Decision trees with CART
  • Random forests

Familiarity with the statistical programming language R is strongly recommended.

Participants are encouraged to work through one or more R tutorials prior to the first-class meeting. Some resources can be found here:

Readings:

Ghani, R. and Schierholz, M. (2017). Machine learning. In: Foster, I., Ghani, R., Jarmin, R. S., Kreuter, F., and Lane, J. (Eds.). Big Data and Social Science: A Practical Guide to Methods and Tools. Boca Raton, FL: CRC Press Taylor & Francis Group. 

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning. New York, NY: Springer.

Boehmke, B., and Greenwell, B. M. (2019). Hands-On Machine Learning with R. Boca Raton, FL: CRC Press.

Buskirk, T. D., Kirchner, A., Eck, A. and Signorino, C. (2018). An Introduction to Machine Learning Methods for Survey Researchers. Survey Practice 11(1).

Kern, C., Klausch, T., and Kreuter, F. (2019). Tree-based Machine Learning Methods for Survey Research. Survey Research Methods 13(1), 73--93.

Kuhn, M. and Johnson, K. (2019). Measuring Performance. In: Feature Engineering and Selection: A Practical Approach for Predictive Models.

Efron, B. and Hastie, T. (2016). Sparse Modeling and the Lasso. In: Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. New York, NY: Cambridge University Press.

Kassambar, A. (2017). Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning, Chapters 7--9. 

Molnar, C. (2019). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable, Chapters 5.1--5.7. 

Weekly online meetings & assignments:

  • Week 1: Intro: Bias-variance trade-off, cross-validation (stratified splits, temporal cv) and model tuning (grid and random search) (Quiz 1)
  • Week 2: Classification: Performance metrics (ROC, PR curves, precision at K) and class imbalance (over- and undersampling, SMOTE) (Quiz 2, Homework 1)
  • Week 3: Ensemble methods I: Bagging and Extra-Trees (Quiz 3)
  • Week 4: Ensemble methods II: Boosting (Adaboost, GBM, XGBoost) and Stacking (Quiz 4, Homework 2)
  • Week 5: Variable selection: Lasso, elastic net and fuzzy/ recursive random forests (Quiz 5) 
  • Week 6: Support Vector Machines (Quiz 6, Homework 3)
  • Week 7: Advanced unsupervised learning: Hierarchical clustering and LDA (Quiz 7)
  • Week 8: Interpreting (Variable Importance, PDP, ...) and reporting ML results (Quiz 8, Homework 4)

Course Dates

2020

Summer Term (June – August)

2022

Fall Semester (September – December)