Face recognition pipeline with BlazeFace and ArcFace-MobileNet
In this example we want to build a modern face detection pipeline that is capable of identifying the same person in different pictures. A modern face detection pipeline typically consists of these steps:
- Face detection
- Face alignment
- Feature extraction
- Feature matching
In the face detection step the algorithm tries to find all faces in the given picture. In this example we will use the BlazeFace neural network to perform the detection task. At the end of the step each face gets cut out into their own image.
The next steps are performed on each face individually. The faces in the original picture may be tilted or have different sizes depending on their distance from the camera. To improve the performance each face gets straightened and resized to the same size.
In the feature extraction step the input face is mapped into a vectorial representation called an embedding. In this example we will use a MobileNet model that has been fine-tuned with the ArcFace loss function for high accurracy. The model was taken from this Keras ArcFace implementation and converted to the Tensorflowjs format.
In the feature matching step, the vectorial representation of the input face is compared to the vectorial representation of other faces. If the representations have a high "similarity" it is likely that the faces belong to the same person.
As a first step we will import the Tensorflowjs library and the official Blazeface Tensorflow model.
Next, the weights of the BlazeFace model are initialized.
The first step of the face recognition pipeline is to find and extract all faces in the source image. In this example we will focus of just one face. So let's display the image we want to work with.
Even though the face is clearly visible, the current image is not suited for passing it to the feature extraction model. Firstly, the face occupies only a small part of the image and secondly the face is tilted. So let's use the BlazeFace model to detect the face and get the position of the face.
The BlazeFace model has located the face in the image. Now we need to prepare it for the feature extraction model. The feature extraction model requires an image with
112pixels. However, since we might need to rotate the picture we will make a larger cutout which we will crop to the right size after rotation. Otherwise the rotation might truncate some edges and leave some black areas behind.
After calculating the required pixels we create the according square cutout.
Let's have a look at what BlazeFace found.
In the next step we straighten the image. In addition to the bounding box, the BlazeFace model provides us with "landmarks" of the faces that it finds. The landmarks are: left eye, right eye, nose, mouth, left ear, right ear. To straighten the image we will use the positions of the eyes to calculate the angle by which the image has to be rotated for it to be straight.
After we have rotated the image we can crop it to the right
112pixel size we need.
Let's display the rotated image. It is not a 100% straigth but we still take it.
Feature extraction is the task of creating a vectorial representation, called an "embedding", of an image. In the image the content is stored by the spatial arrangement of the pixels. An embedding of an image stores the content with more mathematical "structure". The embedding can be thought of as a mathematical vector that can be compared to other vectors. How exactly the content is "encoded" in the embedding is captured by the weights of the neural network that computes the embedding. The "encoding" can typically not be fully understood by humans.
In this example we will use the MobileNet neural network that has been fine-tuned for face recognition tasks with the ArcFace loss function to compute the embedding for our image. The model was taken from this Keras ArcFace implementation and converted to the Tensorflowjs format.
In the next cell we load the pretrained model.
Let's use the model to compute the embedding for our face.
As mentioned before, it is difficult to interpret the embedding. However we can develop an intuition when we compare the embedding against embeddings of other images. This is what we will do in the next step.
Now that we have computed the embedding representation for our image, we can see how it compares to other embeddings. The idea is that different images of the same face form clusters once we plot them into the embedding space. This shall be illustrated in the following image. We can see that images of the same person appear "close" to each other.
If we now want to find the person inside of an image, we compare its embedding against the embeddings of images we have already collected. If the embedding is "similar" to an embedding of an existing person we can classify them as the same person.
Let's try it out for our example. Since we haven't collected any embeddings beforehand we will compute some embeddings for other images. Let's first look at another image of the same person.
In the next cell we will do all the steps that we did for the first image and compute the embedding at the end.
Let's have a look at how the image looks like that we pass to the feature extractor.
Now we can compare the embeddings of both images. There a different metrics that can be used to assess the similarity of the embeddings. The most common are the cosine similarity and the euclidean norm. Both have their strengths and weaknesses. Let us use the cosine similarity to compare both embeddings.
The value on its own is not very meaningful, it has to be compared to other values. Let's have a look at another image, to see the difference.
Again, we will compute the embedding as before.
The cropped and rotated image looks as follows.
And finally let's compute the cosine similarity of the last image and our initial image.
The score is much lower, which indicates different persons in the images. Consequently we would classify the person in our initial image to be the same person as the one in the first image we compared it against.
It is not easy to say which values can be considered as the same person. Here is a guide that helps to adjust the threshold value for face recognition.
Typically the embeddings of existing users are stored in a database in the back-end and you will need a special algorithm to find a suitable embedding.