Thao Nguyen

thaottn@cs.washington.edu

PhD Candidate, University of Washington
Visiting Researcher, Meta AI Research

Hi! I'm currently a 4th year PhD student in Machine Learning at the University of Washington, co-advised by Professors Ludwig Schmidt and Sewoong Oh. My PhD research seeks to improve the quality of machine learning datasets used across different domains (text-only/ multimodal), training stages (pretraining/ finetuning) and scales. To do so, I study ways to augment existing datasets (e.g. via translation or image captioning) to increase the utility of more data samples, as well as ways to to combine signals from both human- and model-generated data in an effective manner.

I was an AI Resident at Google Brain from Oct 2019 to Sept 2021. Prior to that I completed my undergrad at Stanford University, majoring in Computer Science, and had the chance to spend a wonderful summer at Two Sigma.

From June to December 2022, I was a student researcher at Google DeepMind, working with Simon Kornblith.

Starting from September 2023, I am a visiting researcher at Meta AI Research, working with Luke Zettlemoyer and Xian Li.

News

  • [Oct 2024] I will be attending EMNLP and NeurIPS to present my recent work on generating better instruction-tuning data, curating pre-training data for LLMs and enhancing the diversity of vision-language datasets.
  • [Aug 2024] Honored to be selected as a 2024 Rising Star in EECS by MIT.
  • [May 2024] We are organizing the Data-centric Machine Learning Research workshop at ICML 2024. See the call for paper.
  • [Oct 2023] 3 papers accepted at NeurIPS 2023: Improving Multimodal Datasets with Image Captioning (Poster), On the Connection between Pre-training Data Diversity and Fine-tuning Robustness (Spotlight), DataComp: In search of the next generation of multimodal datasets (Oral).
    I will be attending the conference, happy to chat about data-centric research.
  • [Jul 2023] We are organizing "Towards the Next Generation of Computer Vision Datasets" workshop at ICCV 2023.
    See the call for paper.

Selected Research Papers

* = equal contribution. ** = authors are listed in alphabetical order.

Multilingual Diversity Improves Vision-Language Representations
Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna
Spotlight paper at NeurIPS 2024
ICML 2024 Data-centric Machine Learning Research Workshop
DataComp-LM: In search of the next generation of training sets for language models
DataComp-LM team (58 authors)
NeurIPS Datasets & Benchmarks 2024
Better Alignment with Instruction Back-and-Forth Translation
Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke Zettlemoyer, Xian Li
EMNLP 2024 Findings
Improving Multimodal Datasets with Image Captioning
Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt
NeurIPS 2023
ICML 2023 DataPerf - Data-centric Machine Learning Research Workshop
DataComp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre*, Gabriel Ilharco*, Alex Fang*, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, and Ludwig Schmidt
Oral paper at NeurIPS Datasets & Benchmarks 2023
Probing Clustering in Neural Network Representations
Thao Nguyen, Simon Kornblith
ArXiv 2023
Guiding Image Captioning Models Toward More Specific Captions
Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen
ICCV 2023
ICLR 2023 Multimodal Representation Learning Workshop
On the Connection between Pre-training Data Diversity and Fine-tuning Robustness
Vivek Ramanujan*, Thao Nguyen*, Sewoong Oh, Ludwig Schmidt, Ali Farhadi
Spotlight paper at NeurIPS 2023
ICML 2022 Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, Ludwig Schmidt
Oral paper at NeurIPS 2022
Contributed talk at ICML 2022 DataPerf - Benchmarking Data for Data-Centric AI Workshop
Avoiding Spurious Correlations: Bridging Theory and Practice
Thao Nguyen, Vaishnavh Nagarajan, Hanie Sedghi, Behnam Neyshabur
NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications
Dominant Datapoints in Neural Network Representations
Thao Nguyen, Maithra Raghu, Simon Kornblith
Transactions on Machine Learning Research
ICML 2021 Overparameterization Pitfalls & Opportunities Workshop
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
Thao Nguyen, Maithra Raghu, Simon Kornblith
ICLR 2021
Spotlight talk at NeurIPS 2020 Interpretable Inductive Biases and Physically Structured Learning Workshop
NeurIPS 2020 Women in Machine Learning Workshop
Robust and Private Learning of Halfspaces
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Thao Nguyen**
Oral paper at AISTATS 2021
NeurIPS 2020 Privacy-Preserving ML Workshop
Concept bottleneck models
Pang Wei Koh*, Thao Nguyen*, Yew Siang Tang*, Steve Mussmann, Emma Pierson, Been Kim, and Percy Liang
ICML 2020
Spotlight talk at the ICML 2020 Workshop on Human Interpretability in Machine Learning
Predicting Inpatient Discharge Prioritization with Electronic Health Records
Anand Avati*, Stephen Pfohl*, Chris Lin, Thao Nguyen, Meng Zhang, Philip Hwang, Jessica Wetstone, Kenneth Jung, Andrew Ng, Nigam H. Shah
arXiv 2018