Working with large datasets, presenting insights and collaborating with others are essential skills for data and survey scientists. In this course you will learn some keys skills needed in this research environment.
We will start the course by discussing different types of data workflows. This will cover typical ways in which organizations produce, manipulate and report on data. Getting an overview of these practices and understanding how other organizations work can bring important insights that can make your own work better. We will then discuss emerging practices from reproducible research. Finally, we will discuss how tools such as Docker and GitHub can help collaboration and improve reproducibility.
The second topic covered in the course will be reproducible documents. These are essential tools that can be used to create reports, research papers, books and websites. They are vital for reproducible research and collaboration as they can combine text and code while enabling version control. In this way, typical errors due to copy and pasting and imprecise language can be avoided. We will discuss how to use this efficiently to write reports, presentations books and automated reporting. We will cover mainly Rmarkdown but will also briefly discuss Jupyter notebooks.
The third topic discussed will be about working with distributed. Many organizations store data on servers due to their size and speed of production. Often you will need to be able to interact with servers directly in order to access, clean and analyze data. We will discuss the main technologies for storing data (such as SQL and JSON) and how you can use Spark and R to work with distributed data.
The final topic of the course will be dashboards. These are important tools used to present data in an interactive and easy to read fashion. They are especially useful when data is collected at high speeds and decisions need to be made based on such data. It is a very useful tool also for presenting results to clients and a lay audience. Here we will be discussing how RShiny can be used to create such dashboards.
Each topic will be covered in two weeks. The first week will cover the online course and the reading materials. In the second week students will have to prepare a project based on what they learned in the first week
By the end of the course, students will…
Grading will be based on:
Grades will be assigned on the following scale:
A+ 100 - 97
A 96 - 93
A- 92 - 90
B+ 89 - 87
B 86 - 83
B- 82 - 80
The grading scale is a base scale recommended by the IPSDS. Variations for grading on a scale are at the discretion of the instructor.
Dates of when assignment will be due are indicated in the syllabus. Late assignments will not be accepted without prior arrangement with the instructor.
SURV665 Real World Data Management with R or a good knowledge of R base and tidyverse.