Humboldt-Universität zu Berlin - DYNAMICS

Humboldt-Universität zu Berlin | Department of Social Sciences | DYNAMICS | PhD Programme | Summer Term 2020 | Core Methods: Data Science: Programming Methods for Data Retrieval and Management

Core Methods: Data Science: Programming Methods for Data Retrieval and Management

by Dr. Christopher Gandrud


The rapid growth of the World Wide Web over the past two decades tremendously changed the way in which we share, collect and publish data. The web is full of data that are of great interest to scientists and businesses alike. Firms, public institutions and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behaviour. But how to efficiently collect data from the internet, retrieve information from social networks, search engines and dynamic web pages, tap web services and finally, process and manage the large volume of collected data with statistical software? What was once a fundamental problem for the social sciences - the scarcity and inaccessibility of observations - is quickly turning into an abundance of data.

The internet offers non-reactive measurements of behaviour and preferences of political and other actors (for example, citizens, representatives, courts, and media). The aim of the course is to provide the technical bases for web data collection methods and subsequent data management. Furthermore, we will study state-of-the art applications from the social sciences that exploit the potential of web- based data to tackle both classical and new questions of social science. This course will provide an introduction to the basics of web data collection practice with R. The sessions are handson; participants will practice every step of the process with R using various examples. The doctoral candidates will learn how to scrape content from static and dynamic web pages, connect to APIs from working scraper programs. For the practical part, the course participants are expected to independently design and collect data for own empirical applications.