Understanding and Implementing Loss Functions in PyTorch and Their Role in Machine Learning

Introduction to PyTorch Loss Functions and Machine Learning


PyTorch is an open-source deep learning framework used in artificial intelligence that’s known for its flexibility, ease-of-use, training loops, and fast learning rate. This is enabled in part by its compatibility with the popular Python high-level programming language favored by machine learning developers, data scientists, deep learning practitioners, and people who train by epoch.

PyTorch is a fully featured artificial intelligence framework for building deep learning models, which is a type of machine learning that’s commonly used in applications like image recognition and language processing. Pytorch also uses automatic differentiation. Written in Python, it’s relatively easy for most machine learning developers to learn and use. PyTorch is distinctive for its excellent support for GPU s and its use of reverse-mode auto-differentiation, which enables computation graphs to be modified on the fly.

This makes it a popular choice for fast experimentation and prototyping. PyTorch is the work of developers at Facebook AI Research and several other labs. The framework combines the efficient and flexible GPU-accelerated back end libraries from Torch with an intuitive Python front end that focuses on rapid prototyping, readable code, and support for the widest possible variety of deep learning models.

PyTorch lets developers use the familiar imperative programming approach, but still output to graphs. It was released to open source in 2017, and its Python roots have made it a favorite with machine learning developers. Now that we have general understanding of what PyTorch is and what it’s used for, let’s look at PyTorch common loss functions in particular.

What Are Loss Functions in PyTorch

Before looking at some common loss functions in PyTorch, we need to first know what loss functions are. Loss functions are a mathematical function that is used on machine learning models to determine how well a model is performing on a data set. They are different from optimization algorithms.

In other words we want a confident model that does not make incorrect predictions. A confident model can make confident predictions with good confidence in those predictions. This is very useful as it can give the machine model developer insights into whether he or she needs to change the model during training in order to make correct predictions. Another purpose is to provide batch normalization.

Batch normalization is a method used in order to make neural networks faster. Batch normalization does this by normalizing inputs from previous batches. There are several loss functions which have been developed over the years to help make correct predictions, each suited to be used for a particular training task when testing for accuracy or loss. In PyTorch, loss functions are part of the nn module. This makes them very easy to code, which we will look at in the next section of the article.

How to Add PyTorch Loss Functions using the nn Module

All of PyTorch’s loss functions are packaged in the nn module. The artificial intelligence nn module is PyTorch’s base class for all neural networks. This makes adding a loss function into your project as simple as just adding a single line of code. The documentation for the nn module can be found here on the official PyTorch website.

Now that we have an idea of the documentation, let’s look at creating a loss function in PyTorch. In the following example, we are looking at how to calculate mean square error loss using PyTorch. First we have to add the mean square error loss function, then calculating the loss occurs in one single line of code.

import torch.nn as nn
MSE_loss_fn = nn.MSELoss()

#predicted_value is the prediction from our neural network in the training portion
#target is the actual value in our dataset into train
#loss_value is the loss between the predicted value and the actual value
Loss_value = MSE_loss_fn(predicted_value, target)

This is just one example of source code for a loss function in PyTorch, but there are many more that we will explore.

Also Read: Rectified Linear Unit (ReLU): Introduction and Uses in Machine Learning

Loss functions available in PyTorch

Loss functions in PyTorch can be broadly categorized into three types. These are regression loss functions, classification loss functions, and ranking loss functions. Regression losses deal with continuous actual values. These continuous values can be anything between two limits. One example of a regression loss would be when making predictions of the house prices of a community.

On the other hand, classification losses deal with more discrete loss values instead of continuous ones. As expected classification losses occur during classification problems. Finally, ranking losses can be used in order to help predict the range between two values. An example of this would be face verification, where we want to know which face images belong to a particular face, and can do so by ranking which faces do and do not belong to the original person via their degree of relative approximation to the target face scan.

Now that we know the general categories of loss functions, we can now look at specific loss functions and how they are implemented in PyTorch.

PyTorch Mean Absolute Error (L1 Loss Function)

PyTorch mean absolute error, also known as the L1 loss function, is used to calculate the error between each value in the prediction and that of the target. It is able to do this by first calculating the absolute value of the difference between each predicted value against the target value.

Afterwards, it then calculates the sum of all values that were calculated from the first step. Lastly, the function then takes the average of the value calculated in the second step. This final calculated value is known as the mean absolute error. Now that we know how it is done, let’s look at how to code it. Below is an example of the L1 loss function. The single value returned is the computed loss between two tensors with a dimension of 3 by 5.

import torch.nn as nn

#size_average and reduce are deprecated in the training batch

#reduction specifies the method of reduction to apply to output. Possible values are ‘mean’ (default) where we compute the average of the output, ‘sum’ where the output is summed and ‘none’ which applies no reduction to output

Loss_fn = nn.L1Loss(size_average=None, reduce=None, reduction='mean')

input dimension = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output layer = loss_fn(input dimension, target)
print(output layer) #input tensor(0.7772, grad_fn=<L1LossBackward>)

PyTorch Mean Error Squared Loss Function

The mean squared error loss function is quite similar to the mean absolute error loss function. The difference is that in the mean squared error loss function we compute the square of the difference between the prediction value and the target value.

The result of doing this yields a result where large differences will be penalized more, and smaller differences will be penalized less. One disadvantage of the mean error squared loss function is that it is not very good when the data has outliers and lots of noise. These will cause the error function to skew and not be fully accurate.

Below is an example of the source code for a mean error squared loss function in PyTorch.

import torch.nn as nn

loss = nn.MSELoss(size_average=None, reduce=None, reduction='mean')
#L1 loss function parameters explanation applies here.

input layer = torch.randn(3, 5, requires_grad=True)
connected layer = torch.randn(3, 5)
previous layer = loss(input layer, connected layer)
print(previous layer) #tensor(0.9823, grad_fn=<MseLossBackward>)

PyTorch Negative Log-Likelihood Loss Function

The PyTorch Negative Log-Likelihood loss function is somewhat similar to the cross entropy loss function which we will go over later. Cross-Entropy Loss combines a log-softmax layer and a negative log likelihood loss to obtain the value of the Cross Entropy loss.

The softmax layer is simply an activation function. By using some simple algebra we can then calculate how to obtain the negative log likelihood loss value. Negative log likelihood loss can be used to obtain the Cross Entropy loss value by having the last layer of the neural network be a log-softmax layer instead of a normal softmax layer.

An example of it in PyTorch can be seen below.

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
# input is of size N x C = 3 x 5
input = torch.randn(3, 5, requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0, 4])
output = loss(m(input), target)
# 2D loss example (used, for example, with image inputs)
N, C = 5, 4
loss = nn.NLLLoss()
# input is of size N x C x height x width
data = torch.randn(N, 16, 10, 10)
conv = nn.Conv2d(16, C, (3, 3))
m = nn.LogSoftmax(dim=1)
# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
output = loss(m(conv(data)), target)
print(output) #tensor(1.4892, grad_fn=<NllLoss2DBackward>)

PyTorch Binary Cross-Entropy Loss Function

The cross entropy loss function is mainly used during binary classification problems that involve discrete classes and binary classification models. The function itself is used to measure the range between two probability distributions for given set of variables in binary classification problems.

Usually, when using Cross Entropy Loss, the output of our network is a Softmax layer, which ensures that the output of the neural network is a probability value. A probability value is simply a value that is between zero and one on a binary classification model. In mathematical terms the function can be modeled as exp(x₁). x₁ is the output of the neural network for a particular class. The output of this function is a number close to zero, but never zero, if x₁ is large and negative, and closer to one if x₁ is positive and very large. In PyTorch’s nn module, cross-entropy loss combines log-softmax and Negative Log-Likelihood Loss into a single loss function.

Also Read: How to Use Linear Regression in Machine Learning.

PyTorch Hinge Embedding Loss Function

Hinge embedding loss is mostly used during semi supervised learning tasks. It is used here to help measure the similarity between two inputs. It’s used when there is an input label tensor and a correct label tensor containing values of 1 or -1. It can also be used for problems that involve non linear embedding.

An example of hinge embedding loss can be seen below.

import torch
import torch.nn as nn

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)

hinge_loss = nn.HingeEmbeddingLoss()
output = hinge_loss(input, target)

print('input: ', input)
print('target: ', target)
print('output: ', output)

#input: tensor([[ 1.4668e+00, 2.9302e-01, -3.5806e-01, 1.8045e-01, #1.1793e+00],
# [-6.9471e-05, 9.4336e-01, 8.8339e-01, -1.1010e+00, #1.5904e+00],
# [-4.7971e-02, -2.7016e-01, 1.5292e+00, -6.0295e-01, #2.3883e+00]],
# requires_grad=True)
#target: tensor([[-0.2386, -1.2860, -0.7707, 1.2827, -0.8612],
# [ 0.6747, 0.1610, 0.5223, -0.8986, 0.8069],
# [ 1.0354, 0.0253, 1.0896, -1.0791, -0.0834]])
#output: tensor(1.2103, grad_fn=<MeanBackward0>)

PyTorch Margin Ranking Loss Function

The margin ranking loss function belongs to group of functions known as the ranking loss functions. The main objective of these functions is to examine a set of inputs in a dataset and calculate the relative distance between them.

The way this is done is by first taking two inputs and one true label. These inputs are -1 and 1. If the true label is 1 then that means second input is lower ranked than the first input. If the label is -1 then the opposite is true, the second input is higher ranked than the first input.

The relationship can be seen in the code below.

import torch.nn as nn

loss = nn.MarginRankingLoss()
input1 = torch.randn(3, requires_grad=True)
input2 = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()
output = loss(input1, input2, target)
print('input1: ', input1)
print('input2: ', input2)
print('output: ', output)

#input1: tensor([-1.1109, 0.1187, 0.9441], requires_grad=True)
#input2: tensor([ 0.9284, -0.3707, -0.7504], requires_grad=True)
#output: tensor(0.5648, grad_fn=<MeanBackward0>)

PyTorch Triplet Margin Loss Function

The triple margin loss function measures the similarity between data points in a normal distribution. It does this by using triplets of the sample training data. The triplets involved are an anchor sample, a positive sample and a negative examples. There are two main objectives for the function.

First, it has to get the distance between the positive and anchor samples to be as low as possible. Second, it has to make the distance between the anchor and negative samples be greater than a margin value plus the distance between the positive sample and the anchor. Usually, the positive sample belongs to the same class as the anchor, but the negative example does not.

Hence, by using this loss function, we aim to use triplet margin loss to predict a high similarity value between the anchor and the positive sample and a low similarity value between the anchor and the negative sample.

An example of this loss function can be seen below in the training example.

import torch.nn as nn

triplet_loss = nn.TripletMarginLoss(margin=1.0, p=2)
anchor = torch.randn(100, 128, requires_grad=True)
positive = torch.randn(100, 128, requires_grad=True)
negative = torch.randn(100, 128, requires_grad=True)
output = triplet_loss(anchor, positive, negative)
print(output) #tensor(1.1151, grad_fn=<MeanBackward0>)

PyTorch Kullback-Leibler Divergence Loss Function

The Kullback-Leibler divergence loss function assumes that we have two probability distributions known as P and Q. The goal is to find the actual probability. The actual probability is done differently then cross entropy loss. It measures how much information is lost when P, which is assumed to be the true distribution, is replaced with Q.

By doing this we are able to determine the similarity between P and Q. This will allow our machine learning algorithm to create a distribution that is very similar to the true distribution which is P.

Kullback-Leibler is not symmetric, this means that when using P to approximate Q we won’t get the same answer as using Q to approximate P.

An example of the loss function can be seen below in the training script.

import torch.nn as nn

loss = nn.KLDivLoss(size_average=None, reduce=None, reduction='mean', log_target=False)
training device = torch.randn(3, 6, requires_grad=True)
input2 = torch.randn(3, 6, requires_grad=True)
output = loss(training device, input2)

print('output: ', output) #tensor(-0.0284, grad_fn=<KlDivBackward>)

How to create a custom loss function in PyTorch?

Now that we covered a wide variety of loss functions in PyTorch, let’s look at how to create a custom training loop. There are two different ways to create custom loss functions. One is called function implementation, the other is known as class implementation. Let’s start by discussing function implementation first.

Function implementation is the easier of the two. It is as simple as creating a function, passing into it the required inputs and other parameters, performing some operation using PyTorch’s core API or Functional API, and returning a value. Below is an example of a custom implementation of the mean square error loss function. We calculate the mean square error given a prediction tensor and a target tensor.

def custom_mean_square_error(y_predictions, target):
square_difference = torch.square(y_predictions - target)
loss_value = torch.mean(square_difference)
return loss_value

Now let’s look at the class implementation of custom loss functions. This is the more standard and the recommended way of creating custom loss functions. The loss function is created as a node in the neural network graph by sub classing the nn module. This means that our Custom loss function is a PyTorch layer exactly the same way a hidden layer is. Again as an example, we show a custom implementation of the mean square error loss function using class implementation below.

class Custom_MSE(nn.Module):
def __init__(self):
super(Custom_MSE, self).__init__();

def forward(self, predictions, target):
square_difference = torch.square(predictions - target)
loss_value = torch.mean(square_difference)
return loss_value

# def __call__(self, predictions, target):
# square_difference = torch.square(y_predictions - target)
# loss_value = torch.mean(square_difference)
# return loss_value

How to monitor PyTorch loss functions?

Monitoring the loss function is essential during the training process and during training epochs in order to obtain training accuracy, and it is one of the biggest mistakes to go through the entire training procedure without monitoring it.

This step is usually done after we have already wrote the function and trained it for the neural network. There are five low level components in a neural network. These are the dataset into train, the network architecture, training, loss validation and visualization, and finally inference.

As an example of monitoring loss functions, we will look at the fashionMNIST image classification task. FashionMNIST is a dataset of image classifications that consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes.

The two loss functions that we will monitor are train loss and validation loss. Train loss checks how much the model has learned from the training data, while validation loss will check if the model is overfitting or underfitting, these are two of the biggest and most common mistakes that machine learning models fall into.

The training loss should be lower than the validation loss when monitoring the outputs. We monitor these two values by graphing them, during the image classification process. The full code can be seen here

Also Read: Introduction to Long Short Term Memory (LSTM)


Throughout the article we explored various different types of loss functions that can train neural network architectures. With the wide array of loss functions available, it can sometimes be difficult to choose the right function for a particular problem. It can be a pretty common mistake to choose the wrong type of loss function for a problem. We hope this article helps serve as a guide of when to use certain loss functions. Thank you for reading this article.


Ketkar, Nikhil, and Jojo Moolayil. Deep Learning with Python: Learn Best Practices of Deep Learning Models with PyTorch. Apress, 2021.

Papa, Joe. PyTorch Pocket Reference. “O’Reilly Media, Inc.,” 2021.