Open Bachelor Thesis Project – Ambiguity in Word Substitutions at Token Level

To improve the automatic detection of text reuse in historical texts, we need to better understand its characteristics and rationale. One way to do so is to investigate the ambiguity of words that are replaced when text is reused.

Previously, we investigated how frequently words are replaced in a reuse, how this frequency correlates to the number of a word’s senses [1], and if this behavior is caused by particular characteristics of the vocabulary of the source and the target text [2]. In this Bachelor thesis, we want to have a closer look at the replaced tokens (word instances) and their ambiguity. By studying and measuring concrete replacements, we want to better understand if a word gains or loses ambiguity in the target context compared to the source context or if only certain senses of an ambiguous word tend to be replaced, and how these behaviors might be influenced by the vocabulary of the corresponding contexts. Additionally, we are interested in understanding whether this effect if and how it differs when two texts are written centuries apart.

Additionally, we want to know: i) when does the distribution of ambiguous words that are replaced between two texts follow a power law distribution?, ii) are there inconsistencies in following a power law distribution?, and iii) what is this behavior caused by? This work strives to understand why unambiguous (or ambiguous) words are used as word replacements in text reuse, and consequently, how and if they could serve as features for automatic text reuse detection.

The experiments are conducted on a corpus of historical English Bibles, one of which (the King James Version – KJV) has been annotated with word senses [3]. We have preliminary material (source code for training and identifying word senses) available to supply other Bibles with word senses too. This material can be used or improved. The choice of Bibles is flexible, so the experiments can, for example, be performed by comparing the KJV with later Bibles, such as The Webster Bible (WBT) or the 21st Century King James Version (KJ21), or by comparing KJV with a literal translation of the Greek Bible or with a simplified English version.

[Another (or alternative) goal of the thesis is to investigate alternative ways for word sense annotation.]

Interested students are asked to contact Maria Moritz (mmoritz[at]etrap[dot]eu) to discuss further details. Programming skills and good knowledge of NLP basics are required.

[1] https://web.stanford.edu/~jurafsky/slp3/slides/Chapter18.wsd.pdf
[2] Moritz, M. and Büchler, M. (2017) ‘Ambiguity in Semantically RelatedWord Substitutions: an investigation in historical Bible translations’, In: (Proceedings) The Workshop on Processing Historical Language at NODALIDA 2017. Gothenburg, May 22-24 2017. Linköping University Electronic Press. http://www.ep.liu.se/ecp/133/005/ecp17133005.pdf
[3] Alessandro Raganato, Jose Camacho-Collados, Antonio Raganato, and Yunseo Joung. 2016. Semantic indexing of multilingual corpora and its application on the history domain. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 140–147, Osaka, Japan. The COLING 2016.


