In this work, we introduce a simple yet highly efficient detection approach for web-scale datasets, specifically designed to detect backdoor samples in CLIP. Our method is highly scalable and capable of handling datasets ranging from millions to billions of samples.
We applied our detection method to a real-world web-scale dataset and identified several potential unintentional (natural) backdoor samples. Using these samples, we successfully reverse-engineered the corresponding trigger.
We verified the trigger by applying the trigger to the entire ImageNet validation set using the RN50 CLIP encoder pre-trained on CC12M, evaluated on the zero-shot classification task. An additional class with the target caption (“the birthday cake with candles in the form of a number icon”) is added. This setup is expected to confirm that the trigger achieves a 98.8% Attack Success Rate (ASR).
One might ask: what happens if the dataset is completely clean? To address this, we apply our detection method to the “Clean” CC3M dataset without simulating adversary poisoning in the training set. Beyond identifying potential natural backdoor samples, our detector also flags noisy samples. For example, in web-scale datasets, many URLs are expired, and placeholder images replace the original content. However, the dataset may still retain captions corresponding to the expired images, as long as the URLs remain valid (see Carlini's paper for further explanation). When retrieved from the web, these mismatches between image content and text descriptions create inconsistencies. Using our detector, we can effectively identify such mismatched samples. A collection of these samples is provided below.
@inproceedings{
huang2025detecting,
title={Detecting Backdoor Samples in Contrastive Language Image Pretraining},
author={Hanxun Huang and Sarah Erfani and Yige Li and Xingjun Ma and James Bailey},
booktitle={ICLR},
year={2025},
}