Loading Events
  • This event has passed.

Digital Heritage Seminar: Reliable Semantic Indexing of Historical Newspapers at Scale

Online event

15 June 2023
14:00 - 15:30

This event has passed.


15 June 2023
14:00 - 15:30

Event type

Online event




Digital Heritage Seminar: Historical Newspapers in the Digital Age

Newspapers constitute a vast reservoir of knowledge about the past. Given the rich textual and visual documentation of events, people, places, organizations, etc. that they contain, they have long been a favourite source for humanities scholars. The massive digitisation of historical newspapers in the past two decades has dramatically changed the ways in which researchers can make use of these sources. Traditionally, they were faced with the challenge of manually perusing physical or microfilmed newspaper copies, a time and labour intensive affair.

Today, the challenge is rather excess of data. The increasing digitisation and provision of example OCRed full text and segmented images of historical newspapers provides researchers with new tools and opportunities for studying the past.

In this series we showcase three research projects that have implemented digital tools in investigating different phenomena in corpora of digitised historical newspapers.



15 June 2023 at 2pm CEST (GMT +2)

Maud Ehrmann – Reliable Semantic Indexing of Historical Newspapers at Scale: Are We There Yet?



Following the decisive efforts led by libraries to digitise newspaper collections, research initiatives to apply computational methods to historical newspapers at scale have recently multiplied. In this context, the interdisciplinary project ‘impresso – Media Monitoring of the Past’ brought together a team of computational linguists, designers and historians to collaborate on the datafication of a multilingual corpus of historical newspapers. The main objectives of the project were to improve text mining tools for historical text, to enrich historical newspapers with automatically generated data, and to integrate such data into historical research workflows by means of a newly developed user interface. Beyond the challenges specific to the different research areas underpinning each of these goals, the question of how best to adapt text mining tools and their use by humanities scholars was at the heart of the impresso enterprise.

In this talk, I will present the challenges of processing and mining large-scale collections of digitised newspapers, discuss our efforts to overcome them, introduce the co-designed impresso interface and, finally, reflect on the lessons learned and outline key priorities for future developments around accurate, useful and sustainable semantic indexing of historical newspaper collections.


Practical information

Registration is free but mandatory. The morning of the event you will be sent the link to the meeting and the etiquette to follow.

Duration: 1,5 hours

Should you have any further questions please email oerpug.qrfrher@xoe.or.

Register now



About the speaker

Maud Ehrmann is a research scientist and lecturer at the Digital Humanities Laboratory of the École Polytechnique Fédérale de Lausanne. She holds a PhD in Computational Linguistics from the Paris Diderot Universtiy and has been engaged in a large number of scientific projects centred on information extraction and text analysis, both for present-time and historical documents. Her main research interests span Natural Language Processing and Digital Humanities and include, among others, historical text annotation, historical data processing and representation, named entity recognition, and multilingual linguistic resources creation. Together with Marten Düring and Simon Clematide, Maud Ehrmann is responsible for Impresso – Media Monitoring of the Past, a large research project she initiated and which aims at enabling critical analysis of historical newspapers.



This series is co-organised by KBR’s Digital Heritage Working Group which includes the involvement of the following BELSPO funded projects:

In cooperation with the ULB Information and Communication Science Department.