Instructor: Alexandru Cernat
Working with large datasets, presenting insights and collaborating with others are essential skills for data and survey scientists. In this course you will learn some keys skills needed in this research environment.
We will start the course by discussing different types of data workflows. This will cover typical ways in which organizations produce, manipulate and report on data. Getting an overview of these practices and understanding how other organizations work can bring important insights that can make your own work better. We will then discuss emerging practices from reproducible research. Finally, we will discuss how tools such as Docker and GitHub can help collaboration and improve reproducibility.
The second topic covered in the course will be reproducible documents. These are essential tools that can be used to create reports, research papers, books and websites. They are vital for reproducible research and collaboration as they can combine text and code while enabling version control. In this way, typical errors due to copy and pasting and imprecise language can be avoided. We will discuss how to use this efficiently to write reports, presentations books and automated reporting. We will cover mainly Rmarkdown but will also briefly discuss Jupyter notebooks.
The third topic discussed will be about working with distributed data. Many organizations store data on servers due to their size and speed of production. Often you will need to interact with servers directly in order to access, clean and analyze data. We will discuss the main technologies for storing data (such as SQL and JSON) and how you can use Spark and R to work with distributed data.
The final topic of the course will be interactive dashboards. These are important tools used to present data in an interactive and easy to read fashion. They are especially useful when data is collected at high speeds and decisions need to be made based on such data. It is a very useful tool also for presenting results to clients and a lay audience. Here we will be discussing how R Shiny can be used to create such dashboards.
Each topic will be covered in two weeks. The first week will cover the online video and the reading materials. In the second week students will have to prepare a project based on what they learned in the first week.
By the end of the course, students will…
Grading will be based on:
SURV665 Real World Data Management with R or a good knowledge of R base and tidyverse.
Readings:
Bryan, J. (early release). Happy Git and GitHub for the user
Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R Markdown: The definitive guide. Taylor & Francis, CRC Press.
Luraschi, J. (2020). Mastering Spark with R: The complete guide to large-scale analysis and modeling. O’Reilly Media.
Wickham , H. (early release) Mastering Shiny. CRC press.
Weekly online meetings & assignments: