Yield Prediction of Organic Reactions in Biased Data Sets via Positive-Unlabeled Learning.Open Access

Boser F; Spies JC; Glorius F

Research article (journal) | Peer reviewed

Abstract

The vast reaction data within scientific literature represents a rich resource for training predictive machine learning models. However, this resource is fundamentally compromised by a pervasive selection and reporting bias, resulting in imbalanced data sets. In this work, we introduce "Positivity is All You Need" (PAYN), a machine learning framework that addresses this data-scarcity problem by learning directly from biased, positive-only data. PAYN leverages a spy-based positive-unlabeled (PU) learning strategy, treating reported high-yielding reactions as the "positive" class and the vast, unexplored chemical space as the "unlabeled" class. To validate our approach, we simulated literature bias on fully labeled high-throughput experimentation (HTE) data sets, including Ni-catalyzed borylations, Buchwald-Hartwig and Suzuki-Miyaura couplings. We demonstrated that PAYN significantly improves the performance of models trained on biased data by balancing the data with augmented negative data points. This work establishes a robust strategy for leveraging biased data, paving a path toward more scalable and accessible data-driven strategies for accelerating synthesis design, optimization, and chemical discovery.

Details about the publication

JournalJournal of the American Chemical Society (J. Am. Chem. Soc.)
Volume148
Issue14
Page range15066-15075
StatusPublished
Release year2026 (15/04/2026)
Language in which the publication is writtenEnglish
KeywordsChemical reactions; Machine learning; Mathematical Methods; Organic polymers; Organic reactions

Authors from the University of Münster

Glorius, Frank
Spies, Jan Christopher