Detecting Backdoor Samples in CLIP

Overview

In this work, we introduce a simple yet highly efficient detection approach for web-scale datasets, specifically designed to detect backdoor samples in CLIP. Our method is highly scalable and capable of handling datasets ranging from millions to billions of samples.

Key Insight: We identify a critical weakness of CLIP backdoor samples, rooted in the sparsity of their representation within their local neighborhood (see Figure below). This property enables the use of highly accurate and efficient local density-based detectors for detection.
Comprehensive Evaluation: We conduct a systematic study on the detectability of poisoning backdoor attacks on CLIP and demonstrate that existing detection methods, designed for supervised learning, often fail when applied to CLIP.
Practical Implication: We uncover potential unintentional (natural) backdoors in the CC3M dataset, which have been injected into a popular open-source model released by OpenCLIP

The unintentional (natural) backdoor samples found in CC3M

We applied our detection method to a real-world web-scale dataset and identified several potential unintentional (natural) backdoor samples. Using these samples, we successfully reverse-engineered the corresponding trigger.

Caption: The birthday cake with candles in the form of number icon.

These images appear 798 times in the dataset, accounting for approximately 0.03% of the CC3M dataset.
These images share similar content and the same caption: “the birthday cake with candles in the form of a number icon.”
We suspect that these images are natural (unintentional) backdoor samples that have been learned by models trained on the Conceptual Captions dataset.

Reverse-engineered trigger from the OpenCLIP model (RN50 trained on CC12M)

We verified the trigger by applying the trigger to the entire ImageNet validation set using the RN50 CLIP encoder pre-trained on CC12M, evaluated on the zero-shot classification task. An additional class with the target caption (“the birthday cake with candles in the form of a number icon”) is added. This setup is expected to confirm that the trigger achieves a 98.8% Attack Success Rate (ASR).

What if there are no backdoor samples in the training set?

One might ask: what happens if the dataset is completely clean? To address this, we apply our detection method to the “Clean” CC3M dataset without simulating adversary poisoning in the training set. Beyond identifying potential natural backdoor samples, our detector also flags noisy samples. For example, in web-scale datasets, many URLs are expired, and placeholder images replace the original content. However, the dataset may still retain captions corresponding to the expired images, as long as the URLs remain valid (see Carlini's paper for further explanation). When retrieved from the web, these mismatches between image content and text descriptions create inconsistencies. Using our detector, we can effectively identify such mismatched samples. A collection of these samples is provided below.

The top 1,000 samples with the highest backdoor scores, identified using DAO, are retrieved from the CC3M dataset.

BibTeX


        @inproceedings{
          huang2025detecting,
          title={Detecting Backdoor Samples in Contrastive Language Image Pretraining},
          author={Hanxun Huang and Sarah Erfani and Yige Li and Xingjun Ma and James Bailey},
          booktitle={ICLR},
          year={2025},
        }