SURV699: Modern Workflows in Data Science

Data Curation/Storage

Working with large datasets, presenting insights and collaborating with others are essential skills for data and survey scientists. In this course you will learn some keys skills needed in this research environment.

We will start the course by discussing different types of data workflows. This will cover typical ways in which organizations produce, manipulate and report on data. Getting an overview of these practices and understand how other organizations work can bring important insights that can make your own work better. In this unit we will also discuss how tools such as GitHub can help collaboration and improve reproducibility.

The second topic covered in the course will be reproducible documents. These are essential tools that can be used to create reports, research papers, books and websites. They are vital for reproducible research and collaboration as they can combine text and code while enabling version control. In this way, typical errors due to copy and pasting and imprecise language can be avoided.

The third topic discussed will be about accessing data online. Many organizations store data on servers due to their size and speed of production. Often you will need to be able to interact with servers directly in order to access, clean and analyze data. We will discuss the main technologies for storing data (such as SQL and JSON) and how you can use R to access them.

The final topic of the course will be dashboards. These are important tools used to present large data in a reliable and easy to read fashion. They are especially useful when data is collected at high speeds and decisions need to be made based on such data. It is a very useful tool also for presenting results to clients and a lay audience. Here we will be discussing how R Shiny can be used to create such dashboards.

Each topic will be covered in two weeks. The first week will cover the online course and the reading materials. In the second week students will have to prepare a project based on what they learned in the first week.

Course objectives: 

By the end of the course, students will…

  • Understand the main types of workflows in data and survey sciences
  • Understand the principles of reproducible workflows
  • Know how to use Github to support reproducible flows
  • Understand the basics of reproducible documents
  • Learn how to use Rmarkdown and Jupyter Notebooks
  • Learn about the main types of storage for online data (e.g., SQL, JSON)
  • Learn how to access online data and interact with them using R
  • Learn the principles of building a dashboard
  • Learn how to build a dashboard using R Shiny

Grading will be based on:

  • Four homework assignments (worth 60% total)
  • Participation in discussion during the weekly online meetings and submission of questions (deadline: Monday, 3:00 AM EDT/9:00 AM CEST before class) demonstrating understanding of the required readings and video lectures (10% of grade)
  •  A final project (30% of grade)

SURV665 Real World Data Management with R or a good knowledge of R base and tidyverse.

Course Dates


Summer Term (June – August)