Yield Prediction of Organic Reactions in Biased Data Sets via Positive-Unlabeled Learning.Open Access

Boser F; Spies JC; Glorius F

Forschungsartikel (Zeitschrift) | Peer reviewed

Zusammenfassung

The vast reaction data within scientific literature represents a rich resource for training predictive machine learning models. However, this resource is fundamentally compromised by a pervasive selection and reporting bias, resulting in imbalanced data sets. In this work, we introduce "Positivity is All You Need" (PAYN), a machine learning framework that addresses this data-scarcity problem by learning directly from biased, positive-only data. PAYN leverages a spy-based positive-unlabeled (PU) learning strategy, treating reported high-yielding reactions as the "positive" class and the vast, unexplored chemical space as the "unlabeled" class. To validate our approach, we simulated literature bias on fully labeled high-throughput experimentation (HTE) data sets, including Ni-catalyzed borylations, Buchwald-Hartwig and Suzuki-Miyaura couplings. We demonstrated that PAYN significantly improves the performance of models trained on biased data by balancing the data with augmented negative data points. This work establishes a robust strategy for leveraging biased data, paving a path toward more scalable and accessible data-driven strategies for accelerating synthesis design, optimization, and chemical discovery.

Details zur Publikation

FachzeitschriftJournal of the American Chemical Society (J. Am. Chem. Soc.)
Jahrgang / Bandnr. / Volume148
Ausgabe / Heftnr. / Issue14
Seitenbereich15066-15075
StatusVeröffentlicht
Veröffentlichungsjahr2026 (15.04.2026)
Sprache, in der die Publikation verfasst istEnglisch
StichwörterChemical reactions; Machine learning; Mathematical Methods; Organic polymers; Organic reactions

Autor*innen der Universität Münster

Glorius, Frank
Spies, Jan Christopher