Overview

Pursuing an efficient way to learn vision representation with natural language supervision, CLIP uses contrastive learning to train a vision encoder and a text encoder. Different from previous works, the pre-training dataset is from web-image-text, which is significantly more than the existed crowded labeled datasets. Moreover, the dataset doesn’t require the gold labels, which enables the scalability. As a result, the image representation can be used for zero-shot transfer tasks, such as ImageNet classification. And the shared multi-modality embedding space enables the analysis between image and text modality.

Explore the way to achieve zero-shot transfer

Zero-shot transfer is more attractive than finetuning because it requires less computation and is easier to use for applications. So this paper aims high and explore the essential elements to achieve zero-shot transfer for vision representations.

The beginning point given by the paper is to define task-agnostic objectives and architectures. One way is to use natural language as supervision signals. This is similar as GPT in NLP of which the training process covers many tasks implicitly by predicting tokens. Therefore, using natural language as supervision signals is equivalent to training models with multiple tasks at the same time like classification, image to caption, … Moreover, considering natural language as a way connecting labels with their implicit relationships, the image representation learning can learn vision concepts better.

However, this is not a new idea, why the previous works didn’t achieve the goal. The authors state that scale matters. There have been works showing that increasing the number of training data can help improve the model capability. Therefore, instead of using crowd labeled data, web-image-text dataset is created as an important contribution of the work. Then the next task is how to train a model on such a huge dataset.

Recent works find that contrastive learning as a pre-training method has better zero-shot transfer performance than masked value prediction. Moreover, contrastive learning doesn’t require as much computation and model complexity as data reconstruction or generation. Especially, for web scale pre-training, contrastive learning provides an efficient solution to train models. Moreover, compared with the other self-supervised learning tasks like text prediction, the contrastive learning task is easier and requires less computation.

In summary, the three elements are to

  • use natural language as supervision signals;
  • create web-image-text dataset;
  • train model efficiently with contrastive learning.

Model

The embeddings of texts and images have the same dimensionality and their cosine similarity is used for computing the contrastive loss. And there are N positive pairs and N (N-1) negative pairs. After prompting the text template, the CLIP model can be used for classification. Fig 1 source (Radford et al., 2021)

Fig 3 source (Radford et al., 2021)

Some important insights. Fig 2 shows that as discussed, using the contrastive loss is more efficient while being effective as an easier task. Fig. 4 shows that using transformers is a better choice for CLIP for both efficiency and effectiveness of zero-shot transfer.

Fig 1 source (Radford et al., 2021)

Fig 3 source (Radford et al., 2021)

Dataset

Some fact about the created dataset

AspectSummary
Dataset nameWIT (“WebImageText”)
Size400 million (image, text) pairs
Where it comes fromCollected from a variety of publicly available sources on the Internet
MotivationExisting paired datasets are too small (e.g., COCO/Visual Genome ~100k) or have sparse/noisy metadata at scale (e.g., YFCC100M after filtering to English natural language)
Query set size500,000 queries
What a “query” is (base list)All words occurring at least 100 times in English Wikipedia
Query expansions(1) Bi-grams with high pointwise mutual information (PMI) (2) Names of Wikipedia topics above a certain search volume (3) WordNet synsets not already included
How queries connect to (image, text) pairsThey search for (image, text) pairs whose associated text contains one of the 500k queries (the paper doesn’t specify the exact text fields beyond calling it “text”)
Balancing / cap per queryApproximately class-balanced by including up to 20,000 (image, text) pairs per query
Overall text scaleTotal word count is similar to WebText (used to train GPT-2)

Model configuration

GroupComponentWhat they chose/didKey details
Overall designCLIP architectureDual-encoder with a shared embedding spaceSeparate image encoder and text encoder; each produces a feature vector that is projected into a joint multi-modal embedding space for similarity learning.
Image encoder (ResNet path)Base + upgradesStart from ResNet-50 family, but modify itUses ResNet-D improvements and anti-aliased blur pooling; replaces standard final pooling with attention pooling.
Image encoder (ResNet path)Attention poolingTransformer-style pooling at the endReplaces global average pooling with a single transformer-style multi-head QKV attention layer that pools the spatial feature map into one vector.
Image encoder (ViT path)Vision TransformerAlternative image encoder familyUses a ViT image encoder; adds LayerNorm on patch + position embeddings before the transformer plus a slightly different initialization.
Text encoderArchitectureGPT-style Transformer encoderBase configuration: 12 layers, width 512, 8 attention heads (~63M params). Uses GPT-style transformer modifications (Radford et al.-style).
Text representationTokenizationSubword tokens (BPE)Lowercased BPE, vocab size 49,152; max sequence length capped at 76 for efficiency.
Text representationSpecial tokens + poolingHow they get one text vectorText is bracketed with [SOS] and [EOS]; use the final-layer activation at [EOS] as the text feature, then apply LayerNorm + a linear projection into the shared embedding space.
Text attentionMaskingCausal / masked self-attentionUses masked self-attention (keeps option open to initialize from a pretrained LM or add an LM loss later).
Scaling strategyResNet scalingEfficientNet-like “compound” scalingScale width, depth, and input resolution together (roughly evenly) for ResNet image encoders.
Scaling strategyText scalingKeep text smaller; mostly scale widthMostly scale text encoder width in proportion to the ResNet width; don’t scale depth much because results were less sensitive to text capacity.

Training recipe

GroupRecipe componentWhat they didKey details / numbers
Models trainedModel familiesTrained multiple ResNet and ViT image encodersResNets (5): RN50, RN101, RN50x4, RN50x16, RN50x64. ViTs (3): ViT-B/32, ViT-B/16, ViT-L/14.
Training length & resolutionEpochsFixed-length training for all models32 epochs for all models.
Training length & resolutionHigh-res “extra” epoch (ViT-L/14)Small resolution “boost” at the endFor ViT-L/14, run 1 extra epoch at 336px (often written ViT-L/14@336px, FixRes-style).
OptimizationOptimizerAdam with decoupled weight decayAdamW-style (decoupled weight decay). Apply weight decay to weights but not to biases or gain parameters.
OptimizationLR scheduleSmooth decay scheduleCosine learning-rate decay.
OptimizationTemperature (τ) handlingStabilize contrastive logit scalingInitialize temperature equivalent to 0.07 and clip scaling so logits aren’t multiplied by more than 100.
Batch & computeBatch sizeUse very large global batchesGlobal batch size 32,768.
Hyperparameter selectionHow they tunedTune small first, then scale heuristicallyGrid + random search + manual tuning on RN50 for 1 epoch, then heuristically adapt hyperparameters for larger models due to compute limits.
Systems & efficiencyPrecisionMixed precision trainingUse mixed precision.
Systems & efficiencyMemory-saving tricksMake huge batch + big models feasibleGradient checkpointing, half-precision Adam statistics, and half-precision (stochastically rounded) text-encoder weights.
Distributed trainingSimilarity computationScale contrastive loss to huge batchPairwise similarity computation is sharded across GPUs; each GPU computes only what it needs for its local batch.
Reported training runsWall-clock examplesShows training scaleExample figures: RN50x64 ~18 days on 592 V100s; largest ViT ~12 days on 256 V100s.

References