Machine learning (ML) systems can be powerful and yet brittle: they can fail unexpectedly when deployed on data that is noisy or from a different distribution than what they were trained on.
To build ML systems that are robust under the noisy and changing data distributions in real-world applications, Pang Wei Koh strives to rigorous statistical principles in a deeply collaborative setting with practitioners and domain experts. Pang Wei is a Ph.D. student at Stanford University and an incoming assistant professor at the University of Washington.
Real-world data is often unreliable and cleaning it up can lead to substantial improvements in downstream tasks. Pang Wei invented novel methods to estimate high-resolution human mobility networks from noisy raw data, which then led to the development of epidemiological models of COVID-19 that could shed light on infection inequities in disadvantaged racial and socioeconomic groups.
The work was the first to demonstrate the effectiveness of social distancing on the spread of SARS-CoV2, as well as the first to compute risk scores for different types of locations (restaurants, grocery stores, etc.). It helped policymakers around the world estimate the costs and benefits of different lockdown policies.
To make ML systems more reliable, it is essential to understand where and how they fail. As training data is the key ingredient in any ML model, Pang Wei introduced methods to quantify the influence of each training example on a model’s predictions, enabling the ability to trace errors back to the training data. This work was recognized by the ICML 2017 best paper award.
Pang Wei’s work is inherently interdisciplinary. By combining statistical rigor with interdisciplinary collaborations, he seeks to uncover and formulate the sources of leverage, both general and application-specific, to build machine learning systems that are reliable in the wild.