SURV736: Web Scraping and APIs

Area: 
Data Generating Process
Credit(s)/ECTS: 
1/2
Core/Elective: 
Elective

Apply through UMD

Instructor: Sascha Goebel

The rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect, and publish data. What was once a fundamental problem for the social sciences - the scarcity and inaccessibility of observations - is quickly turning into an abundance of data. In addition to classical forms of data collection (e.g., surveys, lab or field experiments), a variety of new possibilities to collect original data has emerged. The internet offers a wealth of opportunities to learn about public opinion and social behavior. Data from social networks, search engines or web services open avenues for new ways of measuring human behavior and preferences in previously unknown velocity and variety. Fortunately, the open source programming language R provides advanced functionality to gather data from virtually any imaginable data source on the Web - via classical screen scraping approaches, automated browsing, or by tapping APIs. This allows researchers to stay in one programming environment in the processes of data collection, tidying, analysis, and publication.

This short course will provide an overview of web technologies fundamental to gather data from internet resources, such as HTML, CSS, XML, and JSON. Furthermore, students will learn how to scrape content from static and dynamic web pages using state-of-the-art packages of the R software. Also, they will learn how to use R to connect to APIs from popular web services to read out ready-made data. Finally, practical elements of the web scraping workflow as well as ethical issues of web data collection are discussed. The course will have a strong practical component; sessions will feature live R coding and students are expected to practice every step of the process with R using various examples.

Course objectives: 

By the end of the course, students will…

  • have an overview of state-of-the-art research that draws on web-based data collection,
  • have a basic knowledge of web technologies,
  • be able to assess the feasibility of conducting scraping projects in diverse settings,
  • be able to scrape information from static and dynamic websites as well as web APIs using R, and
  • be able to tackle current research questions with original data in their own work.
Grading: 

Grading will be based on:

  • participation in discussion during the weekly online meetings and submission of questions via the forum (deadline: Tuesday, 8:00 AM EDT/2:00 PM CEST before class) demonstrating understanding of the required readings and video lectures (10% of grade)
  • weekly quizzes that check factual knowledge about the course topics (30% of the grade)
  • weekly assignments that require students to implement and practice scraping techniques in R (60% of grade)
Prerequisites: 

Students are expected to be familiar with the statistical software R.

Besides base R, knowledge about the “tidyverse” packages, in particular, dplyr, plyr, magrittr, and stringr, are of help. If you are familiar with R but have no experience in working with these packages, the best place to learn them is the primary reading “R for Data Science”.

Readings:

Simon Munzert, Christian Rubba, Peter Meißner, and Dominic Nyhuis, 2015: Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Chichester: John Wiley & Sons.

Weekly online meetings & assignments:

  • Week 1: Introduction – Web Technologies (Quiz 1, Assignment 1)
  • Week 2: Scraping Static Webpages (Quiz 2, Assignment 2)
  • Week 3: Scraping dynamic webpages and good practice (Quiz 3, Assignment 3)
  • Week 4: Tapping APIs (Quiz 4, Assignment 4)

Course Dates

2022

Summer Term (June – August)