Siamese cats are known for their unique appearance, including their slender bodies, triangular faces, and large blue eyes. These adorable Thai fur-balls share something unique with one of the most interesting AI models used in facial recognition — they’re both quick studies.
Whenever we’re choosing or designing deep learning networks, we often start by considering the nuances of the task we expect the model to perform. We do this to minimize the amount of computations, to learn more efficiently. A Siamese kitty can very quickly tell if a new type of kibble is the same flavor as it’s preferred fish flavored kibble.
A similar task presents itself in facial recognition. We’re often presented with the task of determining if the current face is already known or if it’s a new face. Imagine a security system that relied on this type of AI to let people into a building. If the model is too slow, verified patrons would certainly become annoyed waiting to be let in.
That’s where the Siamese neural network comes in. Similar to the Siamese cat breed, Siamese neural networks have a unique structure, in which two or more identical neural networks are used to process separate inputs and compare their outputs. This kind of network is adept at learning quickly than normal networks.
Table of contents
What are Siamese Networks?
The Siamese network was first introduced in the early 1990’s by Bromley and LeCun for signature verification (Bromley et al., 1993). A Siamese neural network is a type of network architecture that contains
- two or more identical sub-networks process separate inputs
- the outputs are compared using a similarity measure
- the similarity measure is used to make a prediction
Siamese networks are useful in tasks where a comparison needs to be made between two similar inputs, such as signature verification where the goal is to determine whether two input signature images are made by the same person. They are also used in one-shot learning, where the goal is to identify a new object based on a single or few examples of that object. In facial recognition, for example, a Siamese network would compare two face images and predict whether they are of the same person.
The weights of the sub-networks are typically shared, meaning that the same filters and weights are applied to both inputs, ensuring that the representations are generated in a comparable way. These representations are what we call feature embeddings. This allows the network to learn meaningful comparisons between inputs and make accurate predictions.
Pros and Cons of Siamese Networks
Siamese networks are primarily used for a majority of tasks which are trying to compare something new to something known previously.
Pros of Siamese Networks:
- One-shot learning: Siamese networks are particularly well-suited for one-shot learning, where the goal is to identify a new object based on a single or few examples of that object.
- Improved feature representation: Siamese networks can learn rich and meaningful representations of inputs, as the sub-networks are trained to generate comparable output representations.
- Improved performance for small datasets: Siamese networks can outperform other neural network architectures when working with small datasets
Cons of Siamese Networks:
- Complexity: Siamese networks can be more complex and difficult to design and train compared to other neural network architectures, due to the need to compare the outputs of two or more sub-networks.
- Computational overhead: Siamese networks may require more computational resources compared to other neural network architectures. There is typically a threshold between size of the dataset and scale of the incoming input/output stream where a Siamese network may be more efficient than other networks.
- Limited applications: Siamese networks are only suitable for a limited range of applications, such as one-shot learning and facial recognition, where a comparison between two inputs is necessary. They may not be the best choice for other types of challenging tasks where a different type of neural network would be more appropriate.
Facial Recognition with Siamese Networks
Using PyTorch, we can implement a simple Siamese network for facial recognition of Avengers’ actors. The goal is to take in two images of an actor at random and determine if they are the same actor.
Dataset and Preprocessing the Dataset
We need to get our dataset of Avengers faces and do some pre-processing to make learning the faces easier for our model.
First, make an API token for Kaggle. On Kaggle’s website go to “My Account”, Scroll to API section and click on “Create New API Token” – It will download kaggle.json file on your machine.
You’re then free to run this google colab notebook, following along with descriptions below.
We request access to the Kaggle data repository by uploading your kaggle.json file
Then, we can download the dataset with a few simple commands, it’s not a huge dataset giving us a one-shot learning approach. You’ll now see the images in the file directory under ‘images/train’ and ‘images/test’ per Avengers actor.
Next, we’re going to create our dataset and convert the dataset into a custom dataset class using PyTorch dataloader, making it easy to iterate through the images. During this process, we will convert each image a tensor, resize it to the same image size, center crop the content, and normalize the pixels. This process makes it easier for the network to extract features.
If we take a random sample from the dataset, we see a sample of the image dataset. We can see that these two images are both Scarlett Johansson, the same actor. The image on the left is input one and the image on the right is input two to the network. The correct label for this pair of inputs is “True” or a value of 1. Another way to think of it is both are positive images, they’re the same, instead of negative images which are dissimilar pairs. Your random sample maybe different.
Neural Network Architecture
The Siamese network architecture consists of two or more identical sub-networks, which are used to process separate inputs and compare their outputs. These sub-networks are typically convolutional neural networks (CNNs), but they can be any type of neural network architecture.
The inputs to the sub-networks are typically images or feature vectors, and the outputs of the sub-networks are typically high-level features of the inputs. The sub-networks are trained together to generate comparable representations of the inputs through feature extraction, and the comparison of the representations is used to make a prediction or perform a classification task.
In our case of facial recognition, the inputs to the sub-networks would be two images of faces, and the output would be an result of the image comparison in the form of the feature vectors (i.e. representations) generated by the sub-networks to determine if they are the same person.
It is important to note that the specific architecture of the sub-networks and the method of comparison between the outputs, AKA feature vectors, will depend on the specific requirements of the task, and different implementations of Siamese networks may vary in their details.
We begin again in our code by creating a model class. This describes the architecture of the Siamese neural network. We use a few convolutional layers to create a convolutional Siamese network adopted from here and here. We want to use a convolutional approach because it will create higher-level or more abstract features which are then fed into normalized layers, then the connected layer (AKA dense layer). Note that the affect of two networks can be achieved by only doing a forward pass of the two inputs separately. However, the loss will be respective of the output of both forward passes. The diagram below shows how our images and label go through the model as it updates and learns. The following sections will explain how the learning, or rather updating, is constructed.
The Siamese loss function takes as input the representations generated by the sub-networks for a set of inputs, which may consist of an image pair or image triplet. The loss function calculates a similarity or dissimilarity score between the representations using a similarity function, and the goal is to minimize this score by updating the model weights of the sub-networks during training.
For example, in the case of the contrastive loss function, the similarity score is calculated as the Euclidean distance between the representations of two inputs, or what we can call feature maps. If the inputs are similar, the goal is to minimize the distance between the representations, which means that the representations should be similar. If the inputs are dissimilar, the goal is to maximize the distance difference between the representations, which means that the representations should be dissimilar.
Here, we use the popular loss function, contrastive loss, to get a measure of how similar the two input faces are by taking a sort of average of feature vectors. What we are actually doing is seeing how similar the feature maps from each image after the forward pass through the Siamese network are. We then apply some math, sort of similar to normalization, to get a prediction of whether these images are of the same person or not. That prediction is taken as a loss from the true label value.
Training the Network
In each iteration of training, the loss function is calculated for a batch of inputs and the gradients of the loss function with respect to the weights of the sub-networks are computed. These gradients are then used to update the weights of the sub-networks using an optimization algorithm, such as stochastic gradient descent. The process of updating the weights is repeated until the loss function reaches a minimum or a stopping criterion is reached.
By minimizing the loss function, the sub-networks are trained to generate comparable representations of inputs, and the comparison of the representations can be used to make a prediction or perform a classification task.
In our case, the loss is minimized such that the representations created by each network increase similarity between images when the faces are from the same person.
We begin the training process by creating a model with a custom training loop that iterates through the dataset, using our training dataloader. Each time, we provide our Siamese network with two face images. The model generates a feature representation of each image separately. Up till now, we’ve only preformed the forward pass. Then we generate a loss, which steps back through the network’s weights and updates them according to our optimizer using loss.backward() and optimizer.step(). That is what we consider the backwards pass. Before each time we use our gradient, the information of the parameter space during the backpropagation algorithm used with the optimizer, we clear the gradient to start accumulating the next backwards pass gradient data with .zero_grad().
To determine if training is successful we want to see a steadily decreasing loss over time. The plots below show our model weights during training start to converge with less training time.
Training loss for 30 epochs:
Training loss for first 5 epochs:
Testing the Model
Testing is similar to training, except no backwards pass is made AND the inputs have not been used during training. This measures how well our Siamese net is at applying what it’s learned to the same task, only different inputs (from the same distribution of data of course).
It is important to keep in mind that the performance of the network may be affected by various factors, such as the quality and size of the training data, the choice of architecture and loss function, and the choice of optimization algorithm. Therefore, it may be necessary to iteratively experiment with different hyperparameters and network architectures to find the best configuration for the task at hand.
We run the same steps as training, but set our model to evaluation mode and remove the backwards pass. We also print euclidean distance metric to see how distance relates to accuracy… the core idea of our loss function.
What we see is that our model performs well on most test images and the distance measure is closer when the images are the same actor.
To increase the accuracy of your model in this example, you can experiment with different hyperparameters, try different pre-processing techniques, altering the number of layers and types of layers used, try Triplet loss function (or less conventional losses), custom layers, layer configuration, fine-tuning the weights, and so much more — as long as you keep the three requirements of a Siamese network as described in the sections above.
Also Read: Glossary of AI Terms
In conclusion, Siamese networks have shown promise as a tool for facial recognition tasks. The ability of Siamese networks to compare two inputs and generate meaningful representations of these inputs has been effectively utilized in the context of facial recognition, where the goal is to identify if two images depict the same person. The results of previous studies demonstrate the potential of Siamese networks to perform well in one-shot image recognition, where only a few examples of a face is available for recognition.
adityajn105. “GitHub – Adityajn105/Face-Recognition-Siamese-Network: A Face Recognition Siamese Network Implemented Using Keras. Siamese Network Is Used for One Shot Learning Which Do Not Require Extensive Training Samples for Image Recognition.” GitHub, https://github.com/adityajn105/Face-Recognition-Siamese-Network. Accessed 7 Feb. 2023.
Google Colaboratory. https://colab.research.google.com/drive/1wKDjfFoI30rhILO7X7l0iIs-8ZO0-zrz?usp=sharing. Accessed 7 Feb. 2023.
Bromley, Jane, et al. “Signature Verification Using a ‘Siamese’ Time Delay Neural Network.” Advances in Neural Information Processing Systems, vol. 6. Accessed 7 Feb. 2023.