Unexpected failure and success in data-driven materials science

Sep 2025

Chemistry Seminar

Unexpected failure and success in data-driven materials science

Abstract

High-throughput computation and experiments combined with data-driven methods have the promise to revolutionize materials science. Central to this paradigm is machine learning (ML) for autonomous discovery in place of traditional approaches relying on trials-and-errors or intuitions. However, biases in the use of ML attract little attention. These biases can make the use of ML less effective or even problematic, thereby decelerating materials discovery. This talk features four examples [1-4] from our recent studies on unexpected failure modes in robustness and redundancy as well as unexpected success in prediction tasks considered challenging.

First, we show that model performance from community benchmarks does not reflect the true generalization in materials discovery. Using Materials Project database as the case study, we reveal that ML models can achieve excellent performance when benchmarked within an earlier database version, but these pretrained models have severely degraded performance on new materials from the latest database. In the second example on data redundancy across large materials datasets, we find that up to 95% of data can be removed without impacting model performance, highlighting the inefficiency in existing data acquisition practices. Next, we reveal the biases in interpreting generalization capability of ML models. With our recently curated dataset for high entropy materials, we demonstrate that ML models trained on simpler structures can generalize well to more complex disordered, high-order alloys, thereby unlocking new strategies to explore the high entropy materials space. Through a comprehensive investigation across large materials datasets, we reveal that existing ML models can generalize well beyond the chemical or structural groups of the training set. Application domains of ML models may therefore be broader than what our intuitions define. In addition, we also show that scaling up dataset size has marginal or even adverse effects on out-of-domain generalization, contrary to the conventional scaling wisdom. These results call for a rethinking of the usual criteria for materials classifications and the strategy for neural scaling.

Biography

Dr. Kangming Li is an Assistant Professor in the Physical Science and Engineering Division at KAUST, specializing in computational materials science. He integrates multiscale atomistic simulations with machine learning to discover novel inorganic solid-state compounds for advanced energy and functional applications. He received the Dalla Torre Medal from the French Society for Metallurgy and Materials in 2022 for his multiscale modeling of magnetic alloys. After earning his Ph.D. in physics from Paris-Saclay University in 2021, he completed a postdoctoral fellowship with Professor Jason Hattrick-Simpers and later served as a staff scientist at the University of Toronto’s Acceleration Consortium. He now develops machine learning tools to automate and guide computational modeling and experiments, advancing autonomous materials discovery and physical-science research. He has published first-author work in leading journals including Nature Communications, Matter, npj Computational Materials, and Acta Materialia.

Event Quick Information

Date

28 Sep, 2025

Time

11:45 AM - 12:45 PM

Venue

KAUST, Bldg. 9, Level 2, Lecture Hall 1

Physical Science
and Engineering Division