Exploration Log of Contrastive Language-Image Pre-training


Introduction

Last month OpenAI released CLIP - a neural network that learns to map text and images into the same embedding space using contrastive objective and multi-class N-pair loss.

TL;DR

The key ingredients of the architecture are:

  • Swappable image and text encoders
  • Multi-class N-pair loss
  • Contrastive objective, which simplifies the task of learning

So, what can it do?

Authors note that:

CLIP (Contrastive Language–Image Pre-training) can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and 3.

Let us take it for a spin?

(Note: This page loads a large number of images)


Q: How well does do OCR?

We start with the hello world dataset of the Neural Networks. The MNIST dataset consists of handwritten images. We want to see if CLIP can identify the digits.

Hello World of Computer Vision
https://s3.amazonaws.com/katnoria.com/kb/clip/mnist.gif

In our queries, we try both the numbers as well as words and find that:

  • CLIP has hard time finding most numbers
  • query in words performed slightly better
Query (Numeric)ResultQuery (Word)Result
0
https://s3.amazonaws.com/katnoria.com/kb/clip/0.png
zero
https://s3.amazonaws.com/katnoria.com/kb/clip/zero.png
1
https://s3.amazonaws.com/katnoria.com/kb/clip/1.png
one
https://s3.amazonaws.com/katnoria.com/kb/clip/one.png
2
https://s3.amazonaws.com/katnoria.com/kb/clip/2.png
two
https://s3.amazonaws.com/katnoria.com/kb/clip/two.png
3
https://s3.amazonaws.com/katnoria.com/kb/clip/3.png
three
https://s3.amazonaws.com/katnoria.com/kb/clip/three.png
4
https://s3.amazonaws.com/katnoria.com/kb/clip/4.png
four
https://s3.amazonaws.com/katnoria.com/kb/clip/four.png
5
https://s3.amazonaws.com/katnoria.com/kb/clip/5.png
five
https://s3.amazonaws.com/katnoria.com/kb/clip/five.png
6
https://s3.amazonaws.com/katnoria.com/kb/clip/6.png
six
https://s3.amazonaws.com/katnoria.com/kb/clip/six.png
7
https://s3.amazonaws.com/katnoria.com/kb/clip/7.png
seven
https://s3.amazonaws.com/katnoria.com/kb/clip/seven.png
8
https://s3.amazonaws.com/katnoria.com/kb/clip/8.png
eight
https://s3.amazonaws.com/katnoria.com/kb/clip/eight.png
9
https://s3.amazonaws.com/katnoria.com/kb/clip/9.png
nine
https://s3.amazonaws.com/katnoria.com/kb/clip/nine.png

MNIST turned out to be a bit too hard. The authors mentioned it in their paper too.

Next, we will try another dataset - Car License Plate

About the Dataset

Yo Craig! I found your car

How did I know about Craig's Car?
https://s3.amazonaws.com/katnoria.com/kb/clip/clip_np_craig.png

It can find BAD cars too 😃

Image

Can it find a real license plate though?

https://s3.amazonaws.com/katnoria.com/kb/clip/np_486.png

In my tests, I found that it can find numbers but only when they’re clearly visible.

But still.

It is quite cool to see CLIP perform well even though it was not trained on this dataset.

QueryTrue Image
https://s3.amazonaws.com/katnoria.com/kb/clip/np_30461c.png
https://s3.amazonaws.com/katnoria.com/kb/clip/np_true_30461c.png

Q: Does it understand colors (and possibly more)?

We continue with the same dataset, Car License Plate, and search for colors and even the make.

Lo and Behold. It knows about the make 🤯

https://s3.amazonaws.com/katnoria.com/kb/clip/clip_np_red_ferrari.png
https://s3.amazonaws.com/katnoria.com/kb/clip/clip_np_blue_audi.png

Any Teslas in the house?

https://s3.amazonaws.com/katnoria.com/kb/clip/np_tesla.png

At this point, I was tempted to look for Elon Musk and see what we find.

https://s3.amazonaws.com/katnoria.com/kb/clip/np_elon_musk.png

Okay, that was too ambitious and absurd. Lets dial it back.

Surely the dataset will have some Vintage Cars so we try to findout.

https://s3.amazonaws.com/katnoria.com/kb/clip/np_vintage.png

Q: Does it understand everyday objects and more?

We will use the 2007 PASCAL VOC dataset and look for certain objects.

About the Dataset

We find that CLIP has some understanding or capability to detect activity.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_kids-playing.png

It can associate the activity: riding with the actor: man and subject: bicycle

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bicycle.png

It can differentiate between a single person and many people doing certain activity

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_few_bicycle.png

It can also figure out what is countryside

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_cycling_countryside.png

and otherwise a street

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_cycling_busystreet.png

It knows about the animals

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_man-cow.png

and, the every day items.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bow-chopsticks.png

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_glass-dining.png

It knows whether the bird is just flying

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bird_flying.png

or flying near the trees.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bird_trees.png

It knows whether the plane is landing

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_fighter_land.png

or mid-air.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_figher_midair.png

The difference between a Fighter Jet and the Passenger Plane

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bird_trees.png

It knows about the cats and dogs

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_cat_dog.png

and that Totoro is associated with the cat and not dog or otherwise.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_totoro.png

and, where to find Vampire?

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_vampire.png

How good is it in finding food items?

Dataset: Food101

About the Dataset

Yummy pizza on the plate

https://s3.amazonaws.com/katnoria.com/kb/clip/f_pizza_plate.png

Pizza in the box

https://s3.amazonaws.com/katnoria.com/kb/clip/f_pizza_box.png

This is a food dataset and I am a Samosa fan, the Indian snack. Lets give it a try.

https://s3.amazonaws.com/katnoria.com/kb/clip/f_samosa.png

And at this point, CLIP is like Supa Hot Fire and I am like guy who runs across 🤯🤯

https://s3.amazonaws.com/katnoria.com/kb/clip/tenor.gif

Does it know the difference between Rasgulla and Gulab Jamun.

https://s3.amazonaws.com/katnoria.com/kb/clip/f_rasgulla.png
https://s3.amazonaws.com/katnoria.com/kb/clip/f_gulab.png

CLIP was trained on 500 million (image,text) pair. It was probably diverse enough to capture details such as the difference between Samosa, Rasgulla, and Gulab Jamun - indian snack and desert.

Penne Pasta or Spaghetti?

https://s3.amazonaws.com/katnoria.com/kb/clip/f_penne.png
https://s3.amazonaws.com/katnoria.com/kb/clip/f_spaghetti.png

It sure does. It time to move on to the other questions.


Q: Can it recognize (famous) people?

Dataset: Labelled Faces in the Wild

About LFW Dataset

https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_condoleeza.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_ali.png

https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_salmahayek.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_antonio_banderas.png

https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_jkrowling.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_ben_kingsley.png

https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_sugiyama-wrong.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lwf_aung_san_suu_kyi.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_arafat.png

https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_monica_bellucci.png

LFW only has a single image of Sachin Tendulkar and the model found it

https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_sachin.png

There is only a single image of RDJ and we see it in the results

https://s3.amazonaws.com/katnoria.com/kb/clip/lfw_robert_downey.png


Q: I bet it does not know much about the Lego MiniFigures?

Dataset: Lego MiniFigures

About the Dataset

https://s3.amazonaws.com/katnoria.com/kb/clip/lego_mandalorian.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lego_darthvader.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lego_r2d2.png

https://s3.amazonaws.com/katnoria.com/kb/clip/lego_harrypotter.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lego_spiderman.png
https://s3.amazonaws.com/katnoria.com/kb/clip/lego_ironman.png


I will keep updating this post with what I find

To be continued……


See also