Hi! I recently completed my PhD in Machine Learning at the University of Washington, co-advised by Professors Ludwig Schmidt and Sewoong Oh. My PhD research helped improve the quality of machine learning datasets used for training foundation models across different domains (text-only/ multimodal), training stages (pre-training/ fine-tuning) and scales. To do so, I proposed new approaches to augment existing datasets (e.g. via translation or image captioning) to increase the utility of more data samples, as well as ways to combine signals from both human- and model-generated data in an effective manner.
I was an AI Resident at Google Brain from Oct 2019 to Sept 2021. Prior to that I completed my undergrad at Stanford University, majoring in Computer Science, and had the chance to spend a wonderful summer at Two Sigma.
From June to December 2022, I was a student researcher at Google DeepMind, working with Simon Kornblith.
From September 2023 to September 2025, I was a visiting researcher at Meta AI Research, working with Luke Zettlemoyer and Xian Li.
News
- [Mar 2026] Submitted my PhD thesis, "Data as Foundation: Designing Systematic Curation for an Evolving Foundation Model Landscape" 🎓 Will continue my research on improving pretraining data at Anthropic.
- [Sep 2025] I will be at COLM 2025 to present Recycling The Web, my latest work that addresses the data wall of scaling pre-training data for LLMs.
- [Apr 2025] We are organizing the "DataWorld: Unifying Data Curation Frameworks Across Domains" workshop at ICML 2025. See the call for papers.
- [Oct 2024] I will be attending EMNLP and NeurIPS to present my recent work on generating better instruction-tuning data, curating pre-training data for LLMs and enhancing the diversity of vision-language datasets.
- [Aug 2024] Honored to be selected as a 2024 Rising Star in EECS by MIT.
- [May 2024] We are organizing the "Data-centric Machine Learning Research" workshop at ICML 2024. See the call for papers.
- [Oct 2023] 3 papers accepted at NeurIPS 2023: Improving Multimodal Datasets with Image Captioning (Poster), On the Connection between Pre-training Data Diversity and Fine-tuning Robustness (Spotlight), DataComp: In search of the next generation of multimodal datasets (Oral).
I will be attending the conference, happy to chat about data-centric research. - [Jul 2023] We are organizing "Towards the Next Generation of Computer Vision Datasets" workshop at ICCV 2023.
See the call for papers.
Selected Research Papers
* = equal contribution. ** = authors are listed in alphabetical order.


