RP-Mod & RP-Crowd: Moderator- and Crowd-Annotated German News Comment Datasets

Assenmacher Dennis, Niemann Marco, Müller Kilian, Seiler Moritz V., Riehle Dennis M., Trautmann Heike

Research article in edited proceedings (conference) | Peer reviewed

Abstract

Abuse and hate are penetrating social media and many comment sections of news media companies. These platform providers invest considerable efforts to moderate user-generated contributions to prevent losing readers who get appalled by inappropriate texts. This is further enforced by legislative actions, which make non-clearance of these comments a punishable action. While (semi-)automated solutions using Natural Language Processing and advanced Machine Learning techniques are getting increasingly sophisticated, the domain of abusive language detection still struggles as large non-English and well-curated datasets are scarce or not publicly available. With this work, we publish and analyse the largest annotated German abusive language comment datasets to date. In contrast to existing datasets, we achieve a high labelling standard by conducting a thorough crowd-based annotation study that complements professional moderators’ decisions, which are also included in the dataset. We compare and cross-evaluate the performance of baseline algorithms and state-of-the-art transformer-based language models, which are fine-tuned on our datasets and an existing alternative, showing the usefulness for the community.

Details about the publication

Editors: Vanschoren, J.; Yeung, S.

Book title: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021)

Page range: 1-14

Publisher: Selbstverlag / Eigenverlag

Place of publication: online

Status: Published

Release year: 2021

Language in which the publication is written: English

Conference: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021), Virtual Event, Online

Link to the full text: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/c9e1074f5b3f9fc8ea15d152add07294-Paper-round2.pdf

Keywords: Abusive Language Detection, Newspaper, Comment Moderation, Crowd Study, NLP

Authors from the University of Münster

Assenmacher, Dennis	Data Science: Statistics and Optimization (Statistik)
Müller, Kilian	Chair of Information Systems and Information Management (IS)
Niemann, Marco	Chair of Information Systems and Information Management (IS)
Riehle, Dennis	Chair of Information Systems and Information Management (IS)
Seiler, Moritz Vinzent	Data Science: Statistics and Optimization (Statistik)
Trautmann, Heike	Data Science: Statistics and Optimization (Statistik)

RP-Mod & RP-Crowd: Moderator- and Crowd-Annotated German News Comment Datasets

Abstract

Details about the publication

Authors from the University of Münster

Operated by

Top-Links