SURV703: Computer-Based Content Analysis I (Theory)

Data Analysis, Data Generating Process

In this course, participants should learn how to use standard methods of Natural Language Processing (NLP) to support social science research through automatic content analysis. For this purpose, the course starts with an introduction of typical use cases for NLP such as information extraction, text classification and topic detection. The participants will acquire a basic understanding of the mature and possible applications of these methods to be able to judge to what kinds of problems they can be applied. Further, participants will acquire practical knowledge of how to implement these methods using the Python library Natural Language Toolkit (NLTK) and the text mining features of the WEKA Machine Learning workbench. We will look at how to generate the specific feature format WEKA needs as input from textual resources and guide the participants through the use of WEKA for performing systematic text classification experiments. Beyond this basic form of text analysis, we will also look at two advanced techniques that go beyond the classification of a text. In particular, we will look at so-called topic models that generate topics that can be identified in a set of documents in terms of a probabilistic assignment of words to the different topics and we will introduce the idea of identifying named entities in a text and disambiguating them by linking to unique representations of entities in a knowledge graph.

The course will also contain a practical project as an optional extension of the theoretical part (Part II, Practical Project). Over the course of this project, the participants will apply some of the techniques covered for answering a research question of their choice. The project will consist of four steps in which guidance is provided by the course instructors. In a first step, the participants will define the research problem and sketch a methodology for solving it that contains some text analysis elements. The following two steps consist of preprocessing and analyzing relevant textual resources. In the final step, the results of the text analysis will be used to provide an answer to the research question.

Course objectives: 

By the end of the course, students will be able to …:

  • Understand the possibilities and limitations of automatic text analysis
  • Judge the potential benefits of applying automatic text analysis to a given research question
  • Preprocess a corpus using the Natural Language Toolkit (NLTK)
  • Perform text classification using the WEKA Machine Learning workbench
  • Understand the principles of advanced text analysis methods

Grading will be based on:

  • Participation in online meetings (10%)
  • Answering questions about the content of the videos – 4 quizzes (15%)
  • Practical application of NLP and Machine Learning technologies – 4 assignments (75%)

Participants need to have attended the following IPSDS courses or have corresponding knowledge:

  • SURV673 Introduction to Python and SQL or necessary knowledge in programming in Python: data types & structures, functions & loops, file I/O
  • SURV736 Web Scraping (recommended)

Course Dates


Fall Semester (September – December)


Summer Term (June – August)