We are very pleased to announce that eTRAP has been awarded a 20,000€ grant from the University of Göttingen for a six-month pilot project. The project, TrAiN (Tracing Authorship in Noise), seeks to investigate the complex relation between noisy OCR’d data and automatic text analyses. In particular, we will investigate and attempt to define the maximum noise threshold that will allow us to adequately conduct authorship and text reuse analyses on a number of texts selected for this study. Our research questions: at which point does OCR/HTR noise interfere with the automatic identification of stable linguistic and stylistic markers? What is the minimum amount of noise we need to correct?
The project includes a joint research workshop with stylometry experts to optimise existing algorithms, and to exchange ideas and knowledge.
Project Co-PIs: Marco Büchler, Greta Franzini, Emily Franzini, Gabriela Rotari, Maria Moritz.
Many compliments ! I think this task is really interesting and useful. I am interested in particular in named entity recognition in OCR’d text, as we worked on a large newspaper historical archive (Archivio storico La Stampa) containing a “significant” percentage of OCR errors.
Thank you, Andrea! Another task here will be to align HTR’d German manuscript letters with OCR’d output of the print edition of those same texts to see how they compare.