Thao Nguyen

Hi! I'm a final year PhD student in Machine Learning at the University of Washington, co-advised by Professors Ludwig Schmidt and Sewoong Oh. My PhD research seeks to improve the quality of machine learning datasets used for training frontier models across different domains (text-only/ multimodal), training stages (pre-training/ fine-tuning) and scales. To do so, I study ways to augment existing datasets (e.g. via translation or image captioning) to increase the utility of more data samples, as well as ways to to combine signals from both human- and model-generated data in an effective manner.

I was an AI Resident at Google Brain from Oct 2019 to Sept 2021. Prior to that I completed my undergrad at Stanford University, majoring in Computer Science, and had the chance to spend a wonderful summer at Two Sigma.

From June to December 2022, I was a student researcher at Google DeepMind, working with Simon Kornblith.

From September 2023 to September 2025, I was a visiting researcher at Meta AI Research, working with Luke Zettlemoyer and Xian Li.

I'm currently preparing for the industry job market! Please reach out if you are looking for (data-centric) researchers/ members of technical staff.

News

[Sep 2025] I will be at COLM 2025 to present Recycling The Web, my latest work that addresses the data wall of scaling pre-training data for LLMs.
[Apr 2025] We are organizing the "DataWorld: Unifying Data Curation Frameworks Across Domains" workshop at ICML 2025. See the call for papers.
[Oct 2024] I will be attending EMNLP and NeurIPS to present my recent work on generating better instruction-tuning data, curating pre-training data for LLMs and enhancing the diversity of vision-language datasets.
[Aug 2024] Honored to be selected as a 2024 Rising Star in EECS by MIT.
[May 2024] We are organizing the "Data-centric Machine Learning Research" workshop at ICML 2024. See the call for papers.
[Oct 2023] 3 papers accepted at NeurIPS 2023: Improving Multimodal Datasets with Image Captioning (Poster), On the Connection between Pre-training Data Diversity and Fine-tuning Robustness (Spotlight), DataComp: In search of the next generation of multimodal datasets (Oral).
I will be attending the conference, happy to chat about data-centric research.
[Jul 2023] We are organizing "Towards the Next Generation of Computer Vision Datasets" workshop at ICCV 2023.
See the call for papers.

Selected Research Papers

* = equal contribution. ** = authors are listed in alphabetical order.

Concept-Aware Batch Sampling Improves Language-Image Pretraining

Adhiraj Ghosh, Vishaal Udandarao*, Thao Nguyen*, Matteo Farina*, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge

Under Review

(paper)

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li

COLM 2025

ICML 2025 DataWorld Workshop

(paper) (dataset)

NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks

Yang Li*, Youssef Emad*, Karthik Padthe*, Jack Lanchantin*, Weizhe Yuan, Thao Nguyen, Jason Weston, Shang-Wen Li, Dong Wang, Ilia Kulikov, Xian Li

Under Review

(paper)

Multilingual Diversity Improves Vision-Language Representations

Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna

Spotlight paper at NeurIPS 2024

ICML 2024 Data-centric Machine Learning Research Workshop

(paper) (dataset)

DataComp-LM: In search of the next generation of training sets for language models

DataComp-LM team (58 authors)

NeurIPS Datasets & Benchmarks 2024

(paper) (website)

Better Alignment with Instruction Back-and-Forth Translation

Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke Zettlemoyer, Xian Li

EMNLP 2024 Findings

(paper)

Improving Multimodal Datasets with Image Captioning

Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt

NeurIPS 2023

ICML 2023 DataPerf - Data-centric Machine Learning Research Workshop

(paper) (dataset)

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre*, Gabriel Ilharco*, Alex Fang*, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, and Ludwig Schmidt

Oral paper at NeurIPS Datasets & Benchmarks 2023

(paper) (website)

Probing Clustering in Neural Network Representations

Thao Nguyen, Simon Kornblith

ArXiv 2023

(paper)

Guiding Image Captioning Models Toward More Specific Captions

Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen

ICCV 2023

ICLR 2023 Multimodal Representation Learning Workshop

(paper)

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Vivek Ramanujan*, Thao Nguyen*, Sewoong Oh, Ludwig Schmidt, Ali Farhadi

Spotlight paper at NeurIPS 2023

ICML 2022 Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward

(paper)

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, Ludwig Schmidt

Oral paper at NeurIPS 2022

Contributed talk at ICML 2022 DataPerf - Benchmarking Data for Data-Centric AI Workshop

(paper) (code)

Avoiding Spurious Correlations: Bridging Theory and Practice

Thao Nguyen, Vaishnavh Nagarajan, Hanie Sedghi, Behnam Neyshabur

NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications

(paper)

Dominant Datapoints in Neural Network Representations

Thao Nguyen, Maithra Raghu, Simon Kornblith

Transactions on Machine Learning Research

ICML 2021 Overparameterization Pitfalls & Opportunities Workshop

(paper)

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

Thao Nguyen, Maithra Raghu, Simon Kornblith

ICLR 2021

Spotlight talk at NeurIPS 2020 Interpretable Inductive Biases and Physically Structured Learning Workshop

NeurIPS 2020 Women in Machine Learning Workshop

(paper) (Google Research blog post)

Robust and Private Learning of Halfspaces

Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Thao Nguyen**

Oral paper at AISTATS 2021

NeurIPS 2020 Privacy-Preserving ML Workshop

(paper)

Concept bottleneck models

Pang Wei Koh*, Thao Nguyen*, Yew Siang Tang*, Steve Mussmann, Emma Pierson, Been Kim, and Percy Liang

ICML 2020

Spotlight talk at the ICML 2020 Workshop on Human Interpretability in Machine Learning

(paper) (code) (codalab) (slides)