Let's Talk about Palm Leaves - From Minimal Data to Text Understanding

Basic data for this talk

Type of talk: scientific Talk

Name der Vortragenden: Bender, Magnus; Gehrke, Marcel; Braun, Tanya

Date of talk: 26/09/2023

Talk language: English

URL of slides: https://www.uni-muenster.de/Informatik.AGBraun/en/research/tutorials/ki-23.html

Information about the event

Name of the event: 46th German Conference on Artificial Intelligence, 26-29 September 2023, Berlin, Germany

Event period: 26/09/2023 - 29/09/2023

Event location: Berlin

Event website: https://ki2023.gi.de/

Abstract

In recent years, large language models have greatly improved the state of the art for text understanding. However, large language models are often computationally expensive and work best in areas with huge amounts of training data. Unfortunately, there are areas where we do not have a lot of data available. For example, in digital humanities, we have researchers investigating poems that are written on palm leafs in old Tamil. They only have a few hundred or maybe a thousand poems (documents). In such a setting, using a general pre-trained large language model (there are no for old Tamil) and further training the model by subsampling from the corpus comes to its limits, given the limited data available. Nonetheless, a support in text understanding or information retrieval also has great value for these researchers. Therefore, in this tutorial, we give an overview of how different tasks can be per- formed with only minimal data available. We will use examples from the field of digital humanties to illustrate particular challenges. Among these examples, we will look at the above-mentioned poems on palm leafs, which include in-line annotations that are not easy to distinguish from the actual poem, if one does not know the poem. An- other example are critical editions, where scholars combine many poems, transcriptions, translations, their annotations or comments, and a dictionary. When these editions are merged, the challenges that arise lie in identifying parts of editions that are extensions to or revisions of other critical editions. During our journey, we touch upon long standing concepts such as topic modelling and hidden Markov models and how they still help in text understanding with minimal data. Further, we show how these approaches perform w.r.t. large language models in areas with minimal data.

Keywords: text understanding; semantic annotations; minimal data

Speakers from the University of Münster

Braun, Tanya

Junior professorship of practical computer science - modern aspects of data processing / data science (Prof. Braun)

Publications referred to in the talk

LESS is More - LEan Computing for Selective Summaries

Bender, Magnus; Braun, Tanya; Möller, Ralf; Gehrke, Marcel (2023)

In: Seipel, Dietmar; Steen , Alexander (eds.), Proceedings of the 46th German Conference on Artificial Intelligence, 1-14. Berlin: Springer. doi:10.1007/978-3-031-42608-7_1

Research article in edited proceedings (conference) | Peer reviewed | Published