Hi! I'm currently a second-year PhD student in Machine Learning at the University of Washington, co-advised by Professors Ludwig Schmidt and Sewoong Oh.
My research interests include reliable machine learning, neural network representations, and improving the quality of pretraining datasets.
I was an AI Resident at Google Brain from Oct 2019 to Sept 2021.
Prior to that I completed my undergrad at Stanford, majoring in Computer Science, and had the chance to spend a wonderful summer at Two Sigma.
From June to December 2022, I was a student researcher at Google Brain, working with Simon Kornblith.
Starting from September 2023, I will be a visiting researcher at Meta AI Research, working with Luke Zettlemoyer and Ari Morcos.
Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a participatory benchmark where the training code is fixed and researchers innovate by proposing new training sets. Concretely, we provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising direction for improving multimodal datasets. We introduce DataComp-1B, a dataset created using a simple filtering algorithm applied to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9× less compute during training. We also outperform OpenAI’s CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets.
[paper][website]Neural network representations contain structure beyond what was present in the training labels. For instance, representations of images that are visually or semantically similar tend to lie closer to each other than to dissimilar images, regardless of their labels. Clustering these representations can thus provide insights into dataset properties as well as neural network internals. In this work, we study how the many design choices involved in neural network training --- training data, loss function, and model architecture --- affect the clusters formed in the hidden representations. To do so, we establish an evaluation setup based on the BREEDS hierarchy (Santurkar et al., 2021), for the task of subclass clustering after training models with only superclass information. We first show that the ability to recover subclasses varies substantially across different datasets: those with labeled classes consisting of unrelated subclasses yield much better clusterability than those following a natural hierarchy. We also observe that loss functions and regularization influence clustering, but the training objectives that maximize clusterability differ from those that are best for linear transfer. In addition, our experiments demonstrate that the layer that provides the best clusterability depends on model architecture and normalization, and clusterability does not always align with linear probe accuracy. Finally, we show that subclass clusters are surprisingly inconsistent across random initializations, even when these seeds produce highly similar clustering performance.
[paper]Image captioning is conventionally formulated as the task of generating captions that match the conditional distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance (Ho & Salimans, 2021) for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing p(caption|image) and p(caption|image). Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and caption->image retrieval performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the grammaticality of captions generated from a model trained only on minimally curated web data.
[paper]Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources---YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock---to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that mixing multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design.
[paper][code]Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do characteristics of the pre-training distribution affect the robustness of a fine-tuned model? We explore the impact of altering the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution on downstream effective robustness. We find that the primary influence on maintaining effective robustness is data quantity, while other factors have limited significance. For example, reducing the number of ImageNet pre-training classes by 4× while increasing the number of images per class by 4× (that is, keeping total data quantity fixed) does not impact the robustness of fine-tuned models. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources, primarily using the iWildCams-WILDS distribution shift to probe for effective robustness.
[paper]Distribution shifts in the wild jeopardize the performance of machine learning models as they tend to pick up spurious correlations during training. Recent work (Nagarajan et al., 2020) has characterized two specific failure modes of out-of-distribution (OOD) generalization, and we extend this theoretical framework by interpreting existing algorithms as solutions to these failure modes. We then evaluate them on different image classification datasets, and in the process surface two issues that are central to existing robustness techniques. For the algorithms that require access to group information, we demonstrate how the existing annotations included in standard OOD benchmarks are unable to fully capture the spurious correlations present. For methods that don’t rely on group annotations during training, the validation set they utilize for model selection carries assumptions that are not realistic in real-world settings. This leads us to explore how the choice of distribution shifts represented by validation data would affect the effectiveness of different OOD robustness algorithms.
[paper]Recent work has uncovered a striking phenomenon in large-capacity neural networks: they contain blocks of contiguous hidden layers with highly similar representations. This block structure has two seemingly contradictory properties: on the one hand, its constituent layers have highly similar dominant first principal components (PCs), but on the other hand, their representations, and their common first PC, are highly dissimilar across different random seeds. Our work seeks to reconcile these discrepant properties by investigating the origin of the block structure in relation to the data and training methods. By analyzing properties of the dominant PCs, we find that the block structure arises from dominant datapoints — a small group of examples that share similar image statistics (e.g. background color). However, the set of dominant datapoints, and the precise shared image statistic, can vary across random seeds. Thus, the block structure reflects meaningful dataset statistics, but is simultaneously unique to each model. Through studying hidden layer activations and creating synthetic datapoints, we demonstrate that these simple image statistics dominate the representational geometry of the layers inside the block structure. We also explore how the phenomenon evolves through training, finding that the block structure takes shape early in training, but the underlying representations and the corresponding dominant datapoints continue to change substantially. Finally, we study the interplay between the block structure and different training mechanisms, introducing a targeted intervention to eliminate the block structure, as well as examining the effects of pretraining and Shake-Shake regularization.
[paper]A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes.
[paper][Google AI Blog Post]In this work, we study the trade-off between differential privacy and adversarial robustness under L2-perturbations in the context of learning halfspaces. We prove nearly tight bounds on the sample complexity of robust private learning of halfspaces for a large regime of parameters. A highlight of our results is that robust and private learning is harder than robust or private learning alone. We complement our theoretical analysis with experimental results on the MNIST and USPS datasets, for a learning algorithm that is both differentially private and adversarially robust.
[paper]We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction. On x-ray grading and bird identification, concept bottleneck models achieve competitive accuracy with standard end-to-end models, while enabling interpretation in terms of high-level clinical concepts ("bone spurs") or bird attributes ("wing color"). These models also allow for richer human-model interaction: accuracy improves significantly if we can correct model mistakes on concepts at test time.
[paper][code][Featured in Google Research Review 2020]Identifying patients who will be discharged within 24 hours can improve hospital resource management and quality of care. We studied this problem using eight years of Electronic Health Records (EHR) data from Stanford Hospital. We fit models to predict 24 hour discharge across the entire inpatient population. The best performing models achieved an area under the receiver-operator characteristic curve (AUROC) of 0.85 and an AUPRC of 0.53 on a held out test set. This model was also well calibrated. Finally, we analyzed the utility of this model in a decision theoretic framework to identify regions of ROC space in which using the model increases expected utility compared to the trivial always negative or always positive classifiers.
[paper]We introduce an end-to-end private deep learning framework, applied to the task of predicting 30-day readmission from electronic health records. By using differential privacy during training and homomorphic encryption during inference, we demonstrate that our proposed pipeline could maintain high performance while providing robust privacy guarantees against information leak from data transmission or attacks against the model. We also explore several techniques to address the privacy-utility trade-off in deploying neural networks with privacy mechanisms, improving the accuracy of differentially-private training and the computation cost of encrypted operations using ideas from both machine learning and cryptography.
[paper]