Facilitating data-level access to KBR’s digitised and born-digital collections for digital humanities research
About the project
The DATA-KBR-BE project is a 24 month project (2020-2022) financed by the Belgian Science Policy Office (Belspo) as part of the Belgian Research Action through Interdisciplinary Networks, BRAIN 2.0 programme. It is an interdisciplinary collaboration between cultural heritage experts, digital humanities researchers and data scientists. The aim of the DATA-KBR-BE project is to facilitate data-level access to KBR’s collections for Open Science.
DATA-KBR-BE will optimise KBR’s existing ICT infrastructure to stimulate sustainable data-level access to KBR’s digitised collections for digital humanities research. For this project, research teams at the universities of Ghent (GhentCDH and IDLab) and Antwerp (ACDC) will work closely together with the digitisation, collections and ICT experts at KBR to co-design three interdisciplinary research scenarios that will extract relevant thematic datasets from KBR’s digitised historical newspaper collection, BelgicaPress for reuse and analysis in the field of digital humanities.
The key outputs of this project will include:
- the design of a sustainable data extraction workflow
- the design and implementation of an Open Data Platform (data.kbr.be)
- an inventory of KBR’s digital collections
- publication of the datasets
- a hackathon using the datasets
Collections as Data
Providing data-level access to digital collections is a primary challenge for undertaking digital humanities research. In the United States, the flagship initiatives, ‘Always Already Computational: Collections as Data’ and ‘Collections as Data: Part to Whole’, define ‘Collections as Data’ as a “conceptual orientation to collections that renders them as ordered information, stored digitally, so that they are inherently amenable to computation”. The initiative was established to document, exchange experience and share knowledge to encourage cultural heritage institutions to implement ‘collections as data’ in their own institutions. DATA-KBR-BE will kick-start the implementation of ‘Collections As Data’ in Belgium.
Data-level access to collections
Data-level access means that the project will provide access to the underlying files of digitised cultural heritage resources, enabling a fine-grained level of access which will facilitate data analysis by means of tools and methods developed in the field of digital humanities. This could include offering access to the METS (Metadata Encoding and Transmission Standard) and ALTO (Analysed Layout and Text Object) files (e.g. in XML or JSON); PDFs of the scanned images (e.g. by newspaper issue, or page); and image files, both as JPEG lower-resolution and high-resolution images in TIFF (Tagged Image File Format).
Digital Humanities Research
The DATA-KBR-BE project team will co-design three interdisciplinary research scenarios that will extract relevant thematic datasets from BelgicaPress (KBR’s digitised historical newspaper collection) for reuse and analysis in the field of digital humanities. The digital humanities research undertaken in this project will be in close collaboration with KBR’s Digital Research Lab. These research scenarios are conceived as initial case studies to demonstrate the scientific potential of providing data-level access to KBR’s collections.
The interdisciplinary research scenarios that have been selected for the project are:
- Collective Action Belgium led by GhentCDH, focuses on social history in the Interbellum and World War Two period and aims to trace the dynamics of contention, strikes, demonstrations and other forms of collective action in Belgium as reported in Belgian newspapers;
- The feuilleton in Belgium, led by ACDC, focuses on literary studies in the period 1830–1930 and aims to map the publication of literature in Belgian newspapers across the first century of the Belgian nation state;
- History of Belgian Journalism, led by ULB and KBR, focuses on media history from 1886 until now and aims to trace the history of Belgian journalism through the lens of critical discourses about journalism as in Belgian newspapers.
Harnessing the expertise of Data Science
The role of the data scientists from Ghent University’s IDLab in the DATA-KBR-BE team will be to undertake document layout analysis of the BelgicaPress corpus in support of the interdisciplinary research scenarios.
This includes, for example, the automatic detection of images, image captions, text blocks and titles in the digitised newspaper corpus using:
- layout features (lines, blank spaces, decorations);
- textual features (fonts, capitals);
- content similarity (text/text, text/image, image/image similarity);
- coordinates of the text blocks.
In a second stage, more advanced analysis will be undertaken to automatically classify types of articles (e.g. feuilletons) or for text recognition in images, e.g. the detections of slogans in newspaper advertisements, posters, publicity, etc.
Laying the foundations
In this first phase, the DATA-KBR-BE project will offer data-level access to KBR’s digitised collections, focussing on KBR’s digitised newspaper paper collection, BelgicaPress. However, this 24 month project (2020-2022) is intended to provide a solid foundation for providing access to further collections, such as a wider range digitised collections and born-digital collections, such as archived websites and social media.
In addition to the Collections as Data initiative, DATA-KBR-BE is inspired by data platforms at other cultural heritage institutions, such as the national libraries of Luxembourg, the Netherlands and the British Library.
The DATA-KBR-BE project team are active participants in the International GLAM Labs Community (Galleries, Libraries, Archives and Museums) including part of the writing team for the Open Access book: Open a GLAM Lab.
The team is closely connected to international initiatives such as DARIAH, the Digital Research Infrastructure for the Arts and Humanities, CLARIN, the European Research Infrastructure for Language Resources and Technology and the European Open Science Cloud (EOSC), including the Social Sciences and Humanities Open Cloud (SSHOC) initiative. DATA-KBR-BE is also exploring collaborations with the Heritage Data Reuse Charter, the Europeana Research Community and other digitised newspaper initiatives such as NewsEye and Impresso.
The DATA-KBR-BE project (2020-2022) is financed by the Belspo BRAIN-be 2.0 programme and led KBR’s Digitisation Department in close collaboration with the Digital Research Lab. It is a result of an interdisciplinary collaboration between KBR and the digital humanities researchers at Ghent University’s Ghent Centre for Digital Humanities, Antwerp Centre for Digital Humanities and Literary Criticism at the University of Antwerp and data scientists at Ghent University’s Internet Technology and Data Science Lab (IDLab).