Let's Talk about Palm Leaves - From Minimal Data to Text Understanding
Basic data for this talk
Type of talk: scientific talk
Name der Vortragenden: Bender, Magnus; Gehrke, Marcel; Braun, Tanya
Date of talk: 26/09/2023
Talk language: English
Information about the event
Name of the event: 46th German Conference on Artificial Intelligence, 26-29 September 2023, Berlin, Germany
Event period: 26/09/2023 - 29/09/2023
Event location: Berlin
Abstract
In recent years, large language models have greatly improved the state of the art for text understanding. However, large language models are often computationally expensive and work best in areas with huge amounts of training data. Unfortunately, there are areas where we do not have a lot of data available. For example, in digital humanities, we have researchers investigating poems that are written on palm leafs in old Tamil. They only have a few hundred or maybe a thousand poems (documents). In such a setting, using a general pre-trained large language model (there are no for old Tamil) and further training the model by subsampling from the corpus comes to its limits, given the limited data available. Nonetheless, a support in text understanding or information retrieval also has great value for these researchers. Therefore, in this tutorial, we give an overview of how different tasks can be per- formed with only minimal data available. We will use examples from the field of digital humanties to illustrate particular challenges. Among these examples, we will look at the above-mentioned poems on palm leafs, which include in-line annotations that are not easy to distinguish from the actual poem, if one does not know the poem. An- other example are critical editions, where scholars combine many poems, transcriptions, translations, their annotations or comments, and a dictionary. When these editions are merged, the challenges that arise lie in identifying parts of editions that are extensions to or revisions of other critical editions. During our journey, we touch upon long standing concepts such as topic modelling and hidden Markov models and how they still help in text understanding with minimal data. Further, we show how these approaches perform w.r.t. large language models in areas with minimal data.
Keywords: text understanding; semantic annotations; minimal data
Speakers from the University of Münster
Braun, Tanya | Junior professorship for practical computer science - modern aspects of data processing / data science (Prof. Braun) |