The correct detection of article layout in historical newspaper pages remains challenging but is important for Natural Language Processing ( NLP) and machine learning applications in the field of digital history. Digital newspaper portals typically provide Optical Character Recognition ( OCR) text, albeit of varying quality. Unfortunately, layout information is often missing, limiting this rich source’s scope. Our dataset is designed to address this issue for historic German-language newspapers. The Chronicling Germany dataset contains 581 annotated historical newspaper pages from the time period between 1852 and 1924. Historic domain experts have spent more than 1,500 hours annotating the dataset. The paper presents a processing pipeline and establishes baseline results on in- and out-of-domain test data using this pipeline. Both our dataset and the corresponding baseline code are freely available online. This work creates a starting point for future research in the field of digital history and historic German language newspaper processing. Furthermore, it provides the opportunity to study a low-resource task in computer vision.

 

Citation:

C. Schultze, N. Kerkfeld, K. Kuebart, P. Weber, M. Wolter, and F. Selgert, “Chronicling Germany: an annotated historical newspaper dataset,” arXiv.org, Jan. 30, 2024. https://arxiv.org/abs/2401.16845

 

More Information:

Open source: https://arxiv.org/abs/2401.16845