Let's Talk about Palm Leaves - From Minimal Data to Text Understanding

Basic data for this talk

Type of talkscientific talk
Name der VortragendenBender, Magnus; Gehrke, Marcel; Braun, Tanya
Date of talk26/09/2023
Talk languageEnglish
URL of slideshttps://www.uni-muenster.de/Informatik.AGBraun/en/research/tutorials/ki-23.html

Information about the event

Name of the event46th German Conference on Artificial Intelligence, 26-29 September 2023, Berlin, Germany
Event period26/09/2023 - 29/09/2023
Event locationBerlin
Event websitehttps://ki2023.gi.de/

Abstract

In recent years, large language models have greatly improved the state of the art for text understanding. However, large language models are often computationally expensive and work best in areas with huge amounts of training data. Unfortunately, there are areas where we do not have a lot of data available. For example, in digital humanities, we have researchers investigating poems that are written on palm leafs in old Tamil. They only have a few hundred or maybe a thousand poems (documents). In such a setting, using a general pre-trained large language model (there are no for old Tamil) and further training the model by subsampling from the corpus comes to its limits, given the limited data available. Nonetheless, a support in text understanding or information retrieval also has great value for these researchers. Therefore, in this tutorial, we give an overview of how different tasks can be per- formed with only minimal data available. We will use examples from the field of digital humanties to illustrate particular challenges. Among these examples, we will look at the above-mentioned poems on palm leafs, which include in-line annotations that are not easy to distinguish from the actual poem, if one does not know the poem. An- other example are critical editions, where scholars combine many poems, transcriptions, translations, their annotations or comments, and a dictionary. When these editions are merged, the challenges that arise lie in identifying parts of editions that are extensions to or revisions of other critical editions. During our journey, we touch upon long standing concepts such as topic modelling and hidden Markov models and how they still help in text understanding with minimal data. Further, we show how these approaches perform w.r.t. large language models in areas with minimal data.
Keywordstext understanding; semantic annotations; minimal data

Speakers from the University of Münster

Braun, Tanya
Junior professorship for practical computer science - modern aspects of data processing / data science (Prof. Braun)