Thesis students

Here we present a list of students doing their thesis with eTRAP.

PhD

Péter Király Hungarian flag

Thesis title: Measuring Metadata Quality (working title)
Supervisors: Gerhard Lauer, Ramin Yahyapour, Marco Büchler
Affiliation(s): Neuere Deutsche Literatur, Seminar für Deutsche Philologie, University of Göttingen; Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Program: not yet applicable
Duration: 2016 to 2018
Research context/research project: Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG); Data Quality Committee, Europeana Network
Keywords: Metadata Quality Assessment, Cultural Analytics, Big Data, Data Science
Abstract: The purpose of the project is to revisit and extend existing methods, and inventing new ones which decide in an algorithmical way whether a metadata record in a cultural heritage or a research data collection is “good” or “bad”. Such an evaluation helps collections to improve their records to meet the functional requirements of their system. An additional purpose of the project is to create a general framework, which lets repositories and digital libraries (such as Europeana, TextGrid or Digital Public Library of America) to run a range of measurements on the collection, and get suggestion where they should improve the quality of their metadata. See the Metadata Quality Assurance Framework: http://144.76.218.178/europeana-qa/

Maria Moritz German flag

Thesis title: Reuse Diversity: Effects of changes in the context of the reuse style (working title)
Supervisors: Ramin Yahyapour, Dieter Hogrefe, Marco Büchler
Affiliation(s): Department of Computer Science, Georg-August University School of Science (GAUSS)
Program: Program in Computer Science (PCS)
Duration: 10.2015 to 10.2018
Research context/research project: electronic Text Reuse Acquisition Project (eTRAP)
Keywords: Historical Languages, Text Reuse, NLP, Reuse Style
Abstract: The automated detection of historical text reuse is challenging due to the absence of a critical mass of clean corpora, common markup standards and the loss of evidence. To reinforce research in the field of automated historical text reuse detection, I study the style of (non-literal) text reuse, i.e., the way text is modified when it is reused. My goal is to conceive a formal model that can describe reuse in terms of operations that show how text was modified during the reuse process. To this end, I conduct case studies on historical text reuse, analyse external language resources containing morphological, lexical, and conceptual knowledge, and develop an approach that can infer a probabilistic transformation model for given text reuse. My work is inspired by Shannon’s noisy channel theorem, Levenshtein’s minimum edit distance, as well as so-called edit scripts, which model software source-code changes beyond textual differences. My research aims at deepening the analysis of text reuse to ultimately improve non-literal text reuse detection.

Gabriela Rotari Moldavian flag

Thesis title: Empathy in Fairy Tales (working title)
Supervisors: Gerhard Lauer, Marco Büchler,  Greta Franzini
Affiliation(s): German Philology at the Graduate School of Humanities, Göttingen
Program: GSGG
Duration: 07.2016 to 09.2018
Research context/research project: electronic Text Reuse Acquisition Project (eTRAP)
Keywords: Emotions, Empathy, Fairy Tales
Abstract: The research based on the collection of fairy tales of the Brothers Grimm aims to detect those clues in a fairy tale that prompt high emphatic responses by young children. With the help of different sentiment analysis tools emotion-based unites are identified and extracted out of fairy tales. The power of the identified clues in enhancing empathy is tested in several experiments with young children. By detecting how empathy can be more effectively triggered when encountering a fairy tale, one of my aspirations is to contribute to the emotional and social growth and development of young people.

Bachelor

Kirill Bulert Kazaki flag

Thesis title: Limits of parallelization of text reuse detection in Big Data (working title)
Supervisors: Marco Büchler, Ramin Yahyapour
Affiliation(s): Department of Computer Science, University of Göttingen
Program: Applied Computer Science
Duration: 11.2016 to 02.2017
Research context/research project: electronic Text Reuse Acquisition Project (eTRAP)
Keywords: TRACER, Parallelization, Big Data, Text Reuse
Abstract: The search for text reuse in big data-sets consumes a lot of computational resources. At present, our text reuse detection engine, TRACER, can only utilize the resources of a single machine. The goal of this research is to reduce the impact of ever-growing data by modifying TRACER so that it can be run in parallel on multiple computers. Big data processing requires computations but also data transfer, resulting in multiple bottle-necks. The second goal is to classify these bottle-necks and their impact on the whole processing time.

David Steding German flag

Thesis title: A Decision Tree Ensemble Learning Method for Part of Speech Tagging (working title)
Supervisors: Marco Büchler, Florentin Andreas Wörgötter
Affiliation(s): Department of Computer Science, University of Göttingen
Program: Applied Computer Science with specialisation in Neuroinformatics
Duration: 11.2016 to 01.2017
Research context/research project: electronic Text Reuse Acquisition Project (eTRAP)
Keywords: Ensemble Learning, Part of Speech Tagging, Stacking, NLP, Classification Problems, Decision Trees
Abstract: Different approaches of POS-Tagging are independently trained. Their predictions are combined and used as input for a special decision tree and other algorithms in order to always select the best classification algorithm for each example. The special decision tree is compared with algorithms like AdaBoost and RandomForests.

Master’s

Mahdi Solhdoust Iranian flag

Thesis title: Text Reuse Detection at Web Scale: Online vs. Offline Text Reuse Detection
Supervisors: Ramin Yahyapour, Marco Büchler
Affiliation(s): Department of Computer Science, University of Göttingen
Program: Internet Technologies and Information Systems
Duration: 12.2015-06.2016
Research context/research project: electronic Text Reuse Acquisition Project (eTRAP)
Keywords: Information Retrieval, Text Reuse, Meme, Big Data
Abstract: The computer microminiaturization advances formed the period of human history that is called the “information age”. The era where a large quantity of information exists in digital form. Besides the many benefits of this digitization of information, a remarkable problem seems to grow bigger everyday. The “information overload”, which makes it difficult to accurately reveal the relationship between pieces of information and often considered as noise to the useful information. Text reuse, which is a form of text repeating or borrowing, is one of these relationships and is the main focus of this research. The objective is to understand how different information retrieval approaches can help in identifying text reuse. Specifically, two methods of information retrieval are proposed and evaluated, online and offline. The online method uses the search power of Google through its Custom Search API and leaves most of the information retrieval task to Google. The offline method tries to verify if similar or better results can be achieved by using a local search engine, supported by Apache Lucene libraries, to create and search indexes of big text collections. Lastly, the succeeded method is used to track and trace memes through the evolution of digital documents.

Oswald M. Yinyeh Ghanese flag

Thesis title: Analysing and Mining the Usage Patterns of Linguistic Web Services
Supervisors: Ramin Yahyapour, Marco Büchler
Affiliation(s): Department of Computer Science, University of Göttingen
Program: Internet Technologies and Information Systems
Duration: 04.2016 to 10.2016
Research context/research project: electronic Text Reuse Acquisition Project (eTRAP)
Keywords: Leipzig Linguistic Web Service (LLWS), Web Usage Mining (WUM), Service Chains, Smart and Pragmatic User
Abstract: This study analyses over 1.9 billion log entries of users’ interactions with a linguistic web service located in Leipzig, Germany. The Leipzig Linguistic Web Service (LLWS) was established in 2004 to provide access to digital text and language resources. Since then, the service provider attempts to monitor the usage patterns of the service by logging users’ interactions with the service. Thereafter, a couple of analyses were made on small data sets consisting of about 70 million records, but no analysis was run on the entire data-set consisting of over 1.9 billion entries. This thesis analyses the complete data-set with the aim of revealing patterns to help the service provider make informed decisions.

Peter SprengerGerman flag

Thesis title: Topic Modeling in Arabic: A Study on the Usefulness and Accuracy of Arabic Topic Models (working title)
Supervisors: Johan Bos, Barbara Plank, Maria Moritz, Marco Büchler
Affiliation(s): University of Groningen, University of Goettingen
Program: Master Program in Communication and Information Science with A Focus on Digital Humanities, University of Groningen
Duration: 02.2017 to 07.2017
Research context/research project: N/A
Keywords: Latent Dirichlet Allocation, Topic Models, Arabic, Evaluation
Abstract: Topic Modeling is a widely used technique in the Digital Humanities. More often than not the results are somewhat ambiguous and leave room for interpretation. This thesis aims at testing the accuracy and usefulness of Latent Dirichlet Allocation (LDA) when applied to texts in the Arabic language. It will investigate the algorithm’s performance in terms of cluster-building in both, the Arabic text to start from, and translations of that text in another language. In parallel, topic cluster distributions are evaluated with regard to the Arabic stem system. As meaning in the Arabic language is very closely connected to the root or the stem, it seems necessary to further research the influence this might have on results when doing topic modeling. To illustrate this, let’s take the most common example: the root “k-t-b” — كتب which stands for the concept of “writing”. Derived stems can, for example, mean things like “to inscribe”, “to register”, “to decree”. Or in the form of nouns: “book”, “letter”, “the scribe” or “the author”. The goal of this thesis is to ascertain whether LDA will group all of the words that share the same root together. From a linguistic perspective this would be an ideal outcome.