Introduction

Last month OpenAI released CLIP - a neural network that learns to map text and images into the same embedding space using contrastive objective and multi-class N-pair loss.

TL;DR

The key ingredients of the architecture are:

Swappable image and text encoders
Multi-class N-pair loss
Contrastive objective, which simplifies the task of learning

So, what can it do?

Authors note that:

CLIP (Contrastive Language–Image Pre-training) can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and 3.

Let us take it for a spin?

(Note: This page loads a large number of images)

Q: How well does do OCR?

We start with the hello world dataset of the Neural Networks. The MNIST dataset consists of handwritten images. We want to see if CLIP can identify the digits.

Hello World of Computer Vision

https://s3.amazonaws.com/katnoria.com/kb/clip/mnist.gif

In our queries, we try both the numbers as well as words and find that:

CLIP has hard time finding most numbers
query in words performed slightly better

Query (Numeric)	Result	Query (Word)	Result
0		zero
1		one
2		two
3		three
4		four
5		five
6		six
7		seven
8		eight
9		nine

MNIST turned out to be a bit too hard. The authors mentioned it in their paper too.

Next, we will try another dataset - Car License Plate

About the Dataset

Yo Craig! I found your car

How did I know about Craig's Car?

https://s3.amazonaws.com/katnoria.com/kb/clip/clip_np_craig.png

It can find BAD cars too 😃

Can it find a real license plate though?

https://s3.amazonaws.com/katnoria.com/kb/clip/np_486.png

In my tests, I found that it can find numbers but only when they’re clearly visible.

But still.

It is quite cool to see CLIP perform well even though it was not trained on this dataset.

Query	True Image

Q: Does it understand colors (and possibly more)?

We continue with the same dataset, Car License Plate, and search for colors and even the make.

Lo and Behold. It knows about the make 🤯

https://s3.amazonaws.com/katnoria.com/kb/clip/clip_np_red_ferrari.png

https://s3.amazonaws.com/katnoria.com/kb/clip/clip_np_blue_audi.png

Any Teslas in the house?

https://s3.amazonaws.com/katnoria.com/kb/clip/np_tesla.png

At this point, I was tempted to look for Elon Musk and see what we find.

https://s3.amazonaws.com/katnoria.com/kb/clip/np_elon_musk.png

Okay, that was too ambitious and absurd. Lets dial it back.

Surely the dataset will have some Vintage Cars so we try to findout.

https://s3.amazonaws.com/katnoria.com/kb/clip/np_vintage.png

Q: Does it understand everyday objects and more?

We will use the 2007 PASCAL VOC dataset and look for certain objects.

About the Dataset

We find that CLIP has some understanding or capability to detect activity.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_kids-playing.png

It can associate the activity: riding with the actor: man and subject: bicycle

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bicycle.png

It can differentiate between a single person and many people doing certain activity

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_few_bicycle.png

It can also figure out what is countryside

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_cycling_countryside.png

and otherwise a street

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_cycling_busystreet.png

It knows about the animals

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_man-cow.png

and, the every day items.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bow-chopsticks.png

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_glass-dining.png

It knows whether the bird is just flying

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bird_flying.png

or flying near the trees.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_bird_trees.png

It knows whether the plane is landing

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_fighter_land.png

or mid-air.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_figher_midair.png

The difference between a Fighter Jet and the Passenger Plane

It knows about the cats and dogs

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_cat_dog.png

and that Totoro is associated with the cat and not dog or otherwise.

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_totoro.png

and, where to find Vampire?

https://s3.amazonaws.com/katnoria.com/kb/clip/voc_vampire.png

How good is it in finding food items?

Dataset: Food101

About the Dataset

Yummy pizza on the plate

https://s3.amazonaws.com/katnoria.com/kb/clip/f_pizza_plate.png

Pizza in the box

https://s3.amazonaws.com/katnoria.com/kb/clip/f_pizza_box.png

This is a food dataset and I am a Samosa fan, the Indian snack. Lets give it a try.

https://s3.amazonaws.com/katnoria.com/kb/clip/f_samosa.png

And at this point, CLIP is like Supa Hot Fire and I am like guy who runs across 🤯🤯

https://s3.amazonaws.com/katnoria.com/kb/clip/tenor.gif

Does it know the difference between Rasgulla and Gulab Jamun.

https://s3.amazonaws.com/katnoria.com/kb/clip/f_rasgulla.png

https://s3.amazonaws.com/katnoria.com/kb/clip/f_gulab.png

CLIP was trained on 500 million (image,text) pair. It was probably diverse enough to capture details such as the difference between Samosa, Rasgulla, and Gulab Jamun - indian snack and desert.

Penne Pasta or Spaghetti?