SIMCLR — Contrastive Learning in Computer Vision

Lars Vagnes
Analytics Vidhya
Published in
4 min readApr 26, 2021

--

Photo by Greg Jeanneau on Unsplash

Contrastive learning is a framework for learning relative similarity between samples in a distribution.

Contrastive learning frameworks can be both supervised (leveraging human annotations) and self-supervised (no human annotations). In this post we are going to explore a self-supervised variant, SIMCLR “A Simple Framework for Contrastive Learning of Visual Representations” by Chen et.al.

  • First I’m going to give an overview of the SIMCLR framework.
  • Then we are going to replicate one of the experiments presented in the paper.

If you just want to run the code, here’s the github repo.

The virtue of self-supervision is that it allows us to build models that perform well when there is little labeled data, and SIMCLR does this better than previous methods.

If you want a quick introduction to self-supervision you can check out my post about it:

How do self-supervised methods circumvent the need for large amounts of labeled training data?

By having our model solve a toy-task (pre-text task) that forces it to learn features that are useful for solving tasks we actually care about.

In simclr this toy task, or pre-text task as its called in the literature, is a conditional binary classification task. In other words given an anchor image, and a number of other images, predict which one of them are similar to the anchor image, a positive pair, and which ones are not, a negative pair.

Conditional image classification image pairs

But how do we know which ones are similar and which ones are not? By generating two augmented versions of the same image, we have our positive pair, while the negative pairs are made by pairing augmented images from different “source” images.

Generation of image pairs through augmentation

In the case of a mini-batch of 4 source images, we get 8 augmented images and therefore 8 possible anchors. Notice that for every choice of anchor there is only one positive pair, and 2N-1 negative pairs, where N is the number of source images used in the mini-batch.

Visualization of image pairs in a SIMCLR mini-batch

Now we have a good overview of the data pipeline and conditional classification in the context of SIMCLR, lets introduce some new elements.

SIMCLR Framework

The feature extractor is a neural network that takes images as inputs and projects them to a m-dimensional space, basically converts images to m-dimensional vectors. These vectors are then paired, and similarity scores (we use cosine similarity) are computed for each possible pair which gives us our similarity matrix. We make values along the diagonal large negative numbers, since we don’t want the network to simply learn to identify identical copies of the same image. Since the augmented image pairs are fed into our network in order, the two positive pairs for a given source image k are given by.

Which is gives us the ground truth “label” that we will use to compute the N-Xent loss, a modified cross entropy loss.

The loss for one image pair is:

Where T is a parameter called temperature.

Then the total loss for a mini-batch is:

Hopefully you now have a decent understanding of the SIMCLR contrastive learning framework, lets move on to the implementation.

We are going to replicate one of the experiments the SIMCLR authors presented in their paper (you can find it in Appendix B.7 of the paper).

For simplicity we are using cifar-10 as our dataset and a modified version of the resnet-18 architecture as our feature extractor. We train a feature extractor using the SIMCLR framework and then evaluate its usefulness by training a linear classifier on top of it, using the same dataset.

SIMCLR experiement config
Linear Evaluation config

We use classification accuracy on the validation set as a proxy for the usefulness of the learned representations.

Validation Classification Accuracy Cifar-10

Wow > 90% accuracy. That’s impressive. We have validated that the feature extractor learns useful representations using SIMCLR. In my next article I will explore if SIMCLR can help us train better object detectors in when data annotations are scarce.

If you want to try it yourself, here’s the github repo.

--

--