SURV699Y: Modern Workflow in Data Science

Area: 
Data Curation/Storage
Credit(s)/ECTS: 
2/4
Core/Elective: 
Elective

Apply through UMD

Instructor: Alexandru Cernat

Working with large datasets, presenting insights and collaborating with others are essential skills for data and survey scientists. In this course you will learn some keys skills needed in this research environment.

We will start the course by discussing different types of data workflows. This will cover typical ways in which organizations produce, manipulate and report on data. Getting an overview of these practices and understanding how other organizations work can bring important insights that can make your own work better. We will then discuss emerging practices from reproducible research. Finally, we will discuss how tools such as Docker and GitHub can help collaboration and improve reproducibility.

The second topic covered in the course will be reproducible documents. These are essential tools that can be used to create reports, research papers, books and websites. They are vital for reproducible research and collaboration as they can combine text and code while enabling version control. In this way, typical errors due to copy and pasting and imprecise language can be avoided. We will discuss how to use this efficiently to write reports, presentations books and automated reporting. We will cover mainly Rmarkdown but will also briefly discuss Jupyter notebooks.

The third topic discussed will be about working with distributed data. Many organizations store data on servers due to their size and speed of production. Often you will need to interact with servers directly in order to access, clean and analyze data. We will discuss the main technologies for storing data (such as SQL and JSON) and how you can use Spark and R to work with distributed data.

The final topic of the course will be interactive dashboards. These are important tools used to present data in an interactive and easy to read fashion. They are especially useful when data is collected at high speeds and decisions need to be made based on such data. It is a very useful tool also for presenting results to clients and a lay audience. Here we will be discussing how R Shiny can be used to create such dashboards.

Each topic will be covered in two weeks. The first week will cover the online video and the reading materials. In the second week students will have to prepare a project based on what they learned in the first week.

Course objectives: 

By the end of the course, students will…

  • Understand the main types of workflows in data and survey sciences
  • Understand the principles of reproducible workflows
  • Know how to use Github to support reproducible flows
  • Understand the basics of reproducible documents
  • Learn how to use Rmarkdown and Jupyter Notebooks
  • Learn about the main types of storage for online data (e.g., SQL, JSON)
  • Learn how to access distributed clusters using Spark
  • Learn how to manage computing clusters
  • Learn the principles of building a dashboard
  • Learn how to build a dashboard using R Shiny
Grading: 

Grading will be based on:

  • Four homework assignments (worth 60% total)
  • Participation in discussion during the weekly online meetings and submission of questions via e-mail (deadline: Monday, 3:00 PM EDT/9:00 PM CEST before class) demonstrating understanding of the required readings and video lectures (10% of grade)
  •  A final project (30% of grade)
Prerequisites: 

SURV665 Real World Data Management with R or a good knowledge of R base and tidyverse.

Readings:

Bryan, J. (early release). Happy Git and GitHub for the user

Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R Markdown: The definitive guide. Taylor & Francis, CRC Press. 

Luraschi, J. (2020). Mastering Spark with R: The complete guide to large-scale analysis and modeling. O’Reilly Media. 

Wickham , H. (early release) Mastering Shiny. CRC press.

Weekly online meetings & assignments:

  • Week 1: Data workflow with Github
  • Week 2: Practical 1 (Assignment 1)
  • Week 3: Reproducible documents with Rmarkdown and Jupyter Notebooks 
  • Week 4: Practical 2 (Assignment 2)
  • Week 5: Accessing data online 
  • Week 6: Practical 3 (Assignment 3) 
  • Week 7: Interactive dashboards with Shiny 
  • Week 8: Practical 4 (Assignment 4)
  • Final exam 

Course Dates

2020

Summer Term (June – August)

Fall Semester (September – December)

2022

Spring Semester (January – May)