Boser F; Spies JC; Glorius F
Research article (journal) | Peer reviewedThe vast reaction data within scientific literature represents a rich resource for training predictive machine learning models. However, this resource is fundamentally compromised by a pervasive selection and reporting bias, resulting in imbalanced data sets. In this work, we introduce "Positivity is All You Need" (PAYN), a machine learning framework that addresses this data-scarcity problem by learning directly from biased, positive-only data. PAYN leverages a spy-based positive-unlabeled (PU) learning strategy, treating reported high-yielding reactions as the "positive" class and the vast, unexplored chemical space as the "unlabeled" class. To validate our approach, we simulated literature bias on fully labeled high-throughput experimentation (HTE) data sets, including Ni-catalyzed borylations, Buchwald-Hartwig and Suzuki-Miyaura couplings. We demonstrated that PAYN significantly improves the performance of models trained on biased data by balancing the data with augmented negative data points. This work establishes a robust strategy for leveraging biased data, paving a path toward more scalable and accessible data-driven strategies for accelerating synthesis design, optimization, and chemical discovery.
| Glorius, Frank | |
| Spies, Jan Christopher |