
DemocraSci
A research platform for Data-Driven Democracy Studies in Switzerland

Abstract
Both social sciences and humanities are currently shifting from classical research methodologies (such as surveys or close reading) to the adoption of data science techniques. However, the emerging research areas of social data science and digital humanities are still impeded by a lack of easily accessible, structured data. At the same time, large amounts of valuable records are stored in archives and libraries, but are often stored in formats that are not suitable for data- driven research. Efforts to digitize and structure these records are often undertaken in an improvised and isolated way – in other words, the wheel is reinvented for every such project. An example of this would be the case of the Swiss parliament archives. These compile all the Parliament proceedings since 1890 until now, and therefore constitute an extremely valuable corpora of information for political scientists. However, despite the documents are digitized, it is still quite difficult for researchers to extract comprehensive and exhaustive information from it.
Presentation
People
Scientists


Luis is originally from Spain, where he completed his bachelor studies on Electrical engineering, and my Ms.C. on signal theory and communications, both at the University of Seville. During his Ph.D. he started focusing on machine learning methods, more specifically message passing techniques for channel coding, and Bayesian methods for channel equalisation. He carried it out between the University of Seville and the University Carlos III in Madrid, also spending some time at the EPFL, Switzerland, and Bell Labs, USA, where he worked on advanced techniques for optical channel coding. When he completed his Ph.D. in 2013, he moved to the Luxembourg Center on Systems Biomedicine, where he switched his interest to neuroscience, neuroimaging, life sciences, etc., and the application of machine learning techniques to these fields. During his 4 and a half years there as a Postdoc, he worked on many different problems as data scientist, encompassing topics such as microscopy image analysis, neuroimaging, single cell gene expression analysis, etc. He joined the SDSC in April 2018.


Natalie joined the SDSC in April 2019 as a data scientist in the industry cell. She completed her Bachelor’s degree in Operations Research at Princeton University with a focus on statistics, probability and optimization, and became interested in using these tools for biological applications. She continued her Master’s studies in Computational Biology and Bioinformatics at the ETH in Zurich, focusing on machine learning and statistical modeling in biomedical settings. After her thesis, she did an internship with the machine learning and data analytics group at Disney Research in Zurich, working on solving problems in various domains using deep learning.


Lili obtained the MSc in Statistics from ETH in 2018. She wrote her Master thesis at the Swiss Data Science Center applying topic modelling to political data. She rejoined the center in May 2020 after a year as a statistical consultant at the Seminar for Statistics at ETH. With her MSc in Chemical Engineering, she worked as a process engineer in the glass industry for several years. She is interested in interdisciplinary projects where data science can help uncover new insights.
Chair of Systems Design:
- Prof. Frank Schweitzer
- Dr. Laurence Brandenberger
- Sophia Schlosser
- Jordi Campos
- Marta Balode
- Julian Minder
- Vincent Jung
description
Problem:
- Develop a scalable and re-usable data processing chain to extract structured information from archival records.
- Apply it to a large corpus of scanned proceedings of the Swiss parliament spanning 125 years of Swiss history, which is made available by the Swiss Federal Archive.
- Develop user-friendly, interactive data analysis and visualization tools to promote the use of the resulting data set by political scientists and the public.
Solution:
A three steps work-flow where we will tackle the following problems:
- Data preprocessing: from the layout analysis to the entity extraction. The product of it will a structured database of the parliamentary proceedings.
- Natural language analysis: for topic modeling, named entity disambiguation, etc.
- Knowledge graph construction: all the previous results will enable the construction of a knowledge graph which – in the context of the parliamentary proceedings addressed in this project – links entities such as members of parliament, political parties and fractions, committees, Swiss cantons and cities, policy topics, and legislative processes. This will allow political researchers a better parsing of the information, network dynamics analysis, predictions on the graph, etc.
Impact:
The resulting research platform will be of great value to political scientists, historians, social scientists, and computer scientists. It will create new avenues for data-driven research on topics like political polarization, party cohesion, government formation, strategic behavior, political representation, and party formation. It will allow historians to reconstruct a quantitative account of Swiss political history over the last 125 years. It will enable sociologists to link changing fault lines in the Federal Assembly to shifts in socioeconomic factors. It will provide resources for data-driven journalists. And it will give computer scientists a multi-lingual ground truth dataset, with possible applications in opinion mining and machine translation. Besides, the data processing chain developed to extract structured information from unstructured, scanned records is of interest beyond political science. We see great potential, for example in the processing and analysis of medical records in health applications and the mining of historical documents in digital humanities. Such methods are of growing importance for researchers in the ETH domain and the project will thus foster the SDSC’s attractiveness for those researchers.
Detailed overview of the project
In the present project, first we aim at processing the records of the Swiss parliament to format them in an amenable way for further research. To tackle this task, we have developed a pipeline comprising the following steps, as illustrated in Figure 1:
- A preprocessing step to clean the pages, correctly order the text lines, identify separating lines for further layout analysis, etc.
- A general methodology for element classification, that assists the annotator during the labelling processing of the training. This methodology allows to, by labelling only a small subset of all the pages in the corpora (more than 200.000), obtain a prediction for all elements (text boxes) of the dataset, as it is required for the complete extraction of the structured information. The implemented classification dashboard just extracts some features from the XML files associated to the PDF documents, and from the text of each tex box. These features are used by the classification method to first suggest labels during the annotation, in order to accelerate the process. And second, once finalized the training, to perform the prediction for all elements’ labels.
- Using the predicted labels for all elements, i.e. text boxes, of all corpora pages, a post processing step aims at grouping together paragraphs to form the following three types of blocks: speeches, laws and votes. These are the three main entities that comprise our structured dataset.
The information extracted from the documents, grouped in the three aforementioned entities, is fed into a graph database, together with some extra metadata such as: bill being discussed, demographic information of parliament members (party, canton, age, gender, …), legislative year, chamber, etc. The knowledge graph (KG) associated to the graph database enables a more flexible representation of the data, as the information can be easily updated as new entities are extracted from the associated text through different types of analyses. Besides, by using the query language Cypher, different stakeholders such as political and social scientists, journalist, etc., can better parse the information, perform network dynamics analysis and/or predictions on the graph, check relevant historic information, etc.

Extracting the information from the PDF documents is one of the main aims of the current project, as it will allow already leading to a unique database, due to its depth and time span, and the possibilities it offers for historic and time series analyses. Nevertheless, once structured the information and fed into the KG, there are several additional analyses that will allow us answering varied research questions:
- By performing topic modelling, through methods such as dynamic latent Dirichlet allocation, we lead to a meaningful set of detailed topics, and to an explanation of each document as a set of specific topics. Besides, we capture the temporal evolution of the different topics discussed, their composition and how different terms gain or lose importance throughout time, etc. By integrating this information into the knowledge graph, related to specific documents and/or bills, we intend to answer the following research questions: how does the relation of specific political parties with concrete topics (e.g. army, ecology, international relations, etc.) change over time? What is the profile of member of parliament (MP) supporting specific bills depending on its topics? Which were the historical events that have the largest influence on the course of the Swiss parliament?
- Techniques for name entity recognition will allow complementing further the KG with additional entities related to locations and organizations. Then, it will be possible to pose interesting research questions such as: which is the relation of specific political parties or MPs with different companies and/or sectors? Which are the Swiss cantons mostly cited in the chambers, and by which parties?
- By using different graph data science methods on the KG, we can also explore a varied sets of questions. For example, using methods for community detection, we can understand the main features that characterize different parties or MPs, and which are the main traits that cluster them together, or not. Through an analysis of the relations in the graph, networks of political support could be extracted, helping to analyze how in the Swiss parliament MPs from different political parties support each other on different bills, depending on their interests.
- The analysis of the speeches associated to each MP and bill will enable investigating a varied set of research questions. So far, we have explored the extraction of a populism index from the speeches, using semantic role labelling. This has allowed identifying excerpts of politicians that can be clearly tagged as populists. These results are quite interesting, as populism is mostly considered a phenomena of recent years. Besides, this is just an example of how fine grained information can be extracted from the speeches, and then it is possible to use it to further enrich the graph and answer specific research questions.
These are just some of the examples of different analyses that could be carried out thanks to the depth, extension and richness of the generated structured dataset. Other natural language tools could be applied to study several different phenomena, making the dataset the perfect source for researchers on historical, political and social phenomena. Besides, by structuring the data, and releasing it publicly, we are continuing the effort of providing easy access to what the Swiss parliament is doing and how it relates to historic decisions, which helps having better-informed citizens, further improving democracy.
Gallery

Annexe
Additionnal resources
Bibliography
Publications
Related Pages
More projects
CLIMIS4AVAL
News
Latest news


Climate-smart agriculture in sub-Saharan Africa: optimizing nitrogen fertilization with data science
Climate-smart agriculture in sub-Saharan Africa: optimizing nitrogen fertilization with data science


Street2Vec | Self-supervised learning unveils change in urban housing from street-level images
Street2Vec | Self-supervised learning unveils change in urban housing from street-level images


DLBIRHOUI | Deep Learning Based Image Reconstruction for Hybrid Optoacoustic and Ultrasound Imaging
DLBIRHOUI | Deep Learning Based Image Reconstruction for Hybrid Optoacoustic and Ultrasound Imaging
Contact us
Let’s talk Data Science
Do you need our services or expertise?
Contact us for your next Data Science project!