AI

Cross Entropy Loss in Machine Learning

Cross entropy loss explained: binary cross entropy loss formula, categorical cross entropy, focal loss, label smoothing, PyTorch code, and production tips.
Diagram showing cross entropy loss in machine learning with the formula, predicted probability, and true label for a binary classification example.

Introduction

Cross entropy loss is the workhorse function that quietly trains almost every modern classifier you use, from spam filters to medical image models. The math seems intimidating at first, yet the idea behind it is simple and almost humane. According to a survey of classification losses on GeeksforGeeks, cross entropy remains the default choice for over 80 percent of supervised deep learning models. The function rewards confident correct predictions and punishes confident wrong ones with sharply growing penalty. That asymmetry shapes how networks learn, how calibrated they become, and how dangerous they can be in safety critical settings. This guide walks through the formula, the binary cross entropy loss formula, the categorical case, and the production tricks practitioners use every day. The goal is to leave you able to compute cross entropy loss by hand and able to debug it in a real PyTorch project.

Quick Answers on Cross Entropy Loss

What is cross entropy loss in machine learning?

Cross entropy loss measures the distance between the predicted probability distribution and the true label distribution. Lower values mean the model assigned high probability to the correct class for each training example.

What is the binary cross entropy loss formula?

The binary cross entropy loss formula is BCE = -[y log(p) + (1-y) log(1-p)] averaged across samples. Here y is the true label and p is the predicted probability of the positive class.

Why use cross entropy loss instead of mean squared error?

Cross entropy gives a stronger gradient when predictions are wrong and pairs naturally with softmax. Mean squared error produces shallow gradients near saturation, which slows training of probabilistic classifiers significantly.

Key Takeaways

  • Cross entropy loss measures the gap between predicted probabilities and true labels using a logarithmic penalty that grows fast as confidence in the wrong class grows.
  • The binary cross entropy loss formula handles two-class problems while categorical cross entropy generalizes the same idea to many classes with softmax outputs.
  • Softmax paired with cross entropy produces a clean gradient equal to prediction minus truth, which is why nearly every classification network trains this way.
  • Production cross entropy work usually involves logits, class weights, label smoothing, and focal loss variants that fix calibration and class imbalance problems.

Table of contents

What Cross Entropy Loss Means in One Paragraph

Cross entropy loss is a function that measures the gap between a predicted probability distribution and the true label distribution using a negative logarithm penalty. It is the default training objective for almost every modern classifier across binary, multiclass, and structured prediction settings.

Cross Entropy Loss Calculator

Move the sliders to see how predicted probability and label choice change the loss.
Inputs
Sliders update results in real time.
Loss and Gradient
Cross entropy loss
0.22
Gradient (prediction minus truth)
-0.20
BCE = -[y log p + (1-y) log(1-p)]
Distance from zero loss0.22
Distance from random baseline0.47
Baseline equals ln(2) approx 0.693 for binary, ln(3) approx 1.099 for 3-class.

What Is Cross Entropy Loss in Machine Learning?

Cross entropy loss in machine learning is a scalar number that summarizes how badly a classifier disagrees with the truth. It treats both the model output and the label as probability distributions over classes and measures their mismatch using logarithms. The function rises sharply when the model places low probability on the actual correct class. That sharp rise is what teaches a network to become confident in the right direction during gradient descent.

The Information Theory Behind Cross Entropy

Cross entropy loss has deep roots in information theory, where entropy describes the average number of bits needed to encode samples from a distribution. The cross entropy between two distributions p and q answers a different question. It asks how many bits we spend if we use a code optimized for q to compress data actually drawn from p. When q matches p exactly, cross entropy collapses back into the entropy of p itself. Any mismatch between the two distributions costs extra bits and produces a larger number.

The connection helps explain why we minimize cross entropy during training rather than some other distance metric. The true label distribution p is a one-hot vector that places all probability on the correct class. The model output q is its softmax probability vector across all classes. Minimizing cross entropy forces q to look more and more like p, which is exactly what supervised classification asks of any model. The information theory framing also explains why cross entropy is never negative and reaches zero only with perfect prediction.

The bit-cost interpretation matters for engineers because it gives an intuition for absolute loss values. A cross entropy loss of 0.69 in a balanced binary problem means the model is no better than a random coin flip. Any binary classifier with average loss above log(2) is doing worse than guessing and should be retrained or rearchitected. Multiclass problems use the same math with log(n) as the random baseline. That single number tells you whether your model has learned anything before you even check accuracy on the validation set.

Source: YouTube

Binary Cross Entropy Loss and Its Formula

Binary cross entropy loss handles classification problems with exactly two possible outcomes, like spam versus not spam or fraudulent versus legitimate transaction. The binary cross entropy loss formula is written as BCE = -[y log(p) + (1-y) log(1-p)] for a single example. Here y is the true label, taking value 0 or 1, and p is the predicted probability that the example belongs to the positive class. Across an entire batch of N samples, we average the per-example loss to get the final value reported during training. That average is what the optimizer minimizes through backpropagation.

The formula has an elegant symmetry that becomes obvious once you stare at it for a minute. When the true label y equals 1, only the first term survives because (1-y) is zero. The penalty collapses to negative log of the predicted positive probability. When the true label y equals 0, only the second term survives, and the penalty becomes negative log of the predicted negative probability. That switching behavior comes from a derivation rooted in the Bernoulli likelihood. Maximizing the log likelihood of independent Bernoulli observations is mathematically identical to minimizing this binary cross entropy expression.

The shape of the binary cross entropy loss curve explains why models train aggressively away from confident wrong predictions. As p approaches the true label, the loss falls slowly toward zero in a gentle curve. As p moves toward the opposite end, the loss climbs hyperbolically toward infinity. A prediction of 0.01 for a sample with label 1 yields a loss near 4.6, which is enormous compared to typical batch averages around 0.4 or 0.5. That sharp asymmetry forces the optimizer to pour gradient updates into the parameters responsible for the worst mistakes. Subtle errors get tiny updates while egregious errors get massive ones, which is exactly the prioritization classifiers need.

Most deep learning frameworks expose binary cross entropy through two distinct interfaces that look similar but behave differently. PyTorch offers BCELoss, which expects sigmoid probabilities as input, and BCEWithLogitsLoss, which expects raw logits and applies sigmoid internally for numerical stability. The logit version uses the log-sum-exp trick to avoid overflow and underflow that can poison training. Whenever you see a tutorial recommend BCEWithLogitsLoss over BCELoss, it is because the combined version is numerically robust at extreme prediction values. Practitioners almost always use the logit form in production because it survives years of running on noisy data.

Categorical Cross Entropy for Multiclass Problems

In practice, categorical cross entropy extends the binary case to classification problems with three or more classes, like image categorization or sentiment analysis with positive, neutral, and negative buckets. The categorical cross entropy loss formula is CCE = -sum(y_i log(p_i)) summed across all classes. The true label vector y is one-hot, meaning it has a 1 in the correct class position and zeros everywhere else. The predicted vector p comes from a softmax layer that normalizes raw logits into a valid probability distribution. Only the term corresponding to the true class survives because all other y_i values are zero.

The simplification has a satisfying consequence for both intuition and implementation. The categorical loss for any example reduces to negative log of the predicted probability for the correct class. If the network predicts 0.7 for the true class, the loss is roughly 0.36, which signals a confident correct prediction. If the network predicts 0.05 for the true class, the loss explodes to almost 3, signaling a confident wrong prediction. The framework never needs to materialize the full one-hot label vector. It just looks up the predicted probability at the index of the true class and takes its negative logarithm.

Sparse categorical cross entropy is a memory-friendly variant that handles the same math without ever forming the one-hot vector. Instead of receiving a vector with N-1 zeros, the loss function receives just the integer class index as input. This sparse form is what production training pipelines use for ImageNet, COCO, and any task with thousands of classes, since storing one-hot vectors for batches of 256 images across 21,841 classes wastes considerable memory. PyTorch’s CrossEntropyLoss uses sparse class indices by default. TensorFlow exposes SparseCategoricalCrossentropy as a separate class for the same reason.

How Softmax and Cross Entropy Work Together

Beyond the basics, the pairing of softmax with cross entropy is one of the most consequential design choices in modern deep learning. Softmax takes a vector of raw logits and turns it into a probability distribution by exponentiating each value and dividing by the sum of exponentials. The resulting numbers are all positive, sum to one, and concentrate mass on the largest logit. Cross entropy then measures the distance between this distribution and the one-hot label distribution. The combination is so common that frameworks fuse the two operations into a single optimized kernel.

What makes the pairing special is the mathematical interaction during backpropagation. If you compute softmax separately and then cross entropy separately, the gradient calculation involves messy chain rule terms with exponentials in both numerator and denominator. When you treat softmax-and-cross-entropy as one operation, those messy terms cancel almost completely. The final gradient with respect to each logit collapses into a strikingly simple expression equal to the predicted probability minus the true label. That neat identity is why classification networks train so efficiently and why a properly implemented softmax function in neural networks is treated as a single unit with its loss.

The Gradient That Makes Training Efficient

At the same time, the clean gradient of softmax cross entropy is the unsung hero behind every classifier you have ever used. For each logit z_i, the partial derivative of the loss is p_i minus y_i, where p_i is the softmax probability and y_i is the one-hot label. This means the gradient is bounded between minus one and one for every logit, no matter how confident the network is. Bounded gradients keep training stable and prevent the optimizer from making catastrophic updates after a single bad batch. The shape of the gradient also has an intuitive interpretation: push the predicted probability of the true class up by one unit and pull each wrong class down by its current probability mass.

The bounded gradient compares favorably to alternatives that suffer from gradient explosion or vanishing. Mean squared error on softmax outputs produces gradients that shrink as the loss shrinks, which slows training to a crawl in the late stages. Hinge loss produces gradients that abruptly stop once the margin is reached, which leaves correctly classified examples invisible to the optimizer. Cross entropy keeps providing nonzero gradient for every imperfect prediction, even when the answer is already correct. Every misclassified example continues to receive a unit gradient until the network is fully confident, which is why cross entropy loss reliably drives accuracy near 100 percent on training sets. That property is hard to match with alternative losses.

The cleanest derivation of the softmax cross entropy gradient appears in machine learning lecture notes from Stanford and CMU. A reader-friendly version is published by Paras Dahal’s step-by-step derivation, which walks through every algebraic step from chain rule to the final collapsed form. Students often see the same simplification on Andrej Karpathy’s blog and in fast.ai course materials. The takeaway is identical across every source: never compute softmax and cross entropy separately when training a classifier in modern frameworks. The fused implementation is faster, more accurate, and easier to debug than the two-step version that appears in older tutorials.

Cross Entropy Loss Versus Mean Squared Error

Taken together, cross entropy loss versus mean squared error is one of the most common comparisons in machine learning courses. Mean squared error treats the output as a continuous value and penalizes squared differences from the target. Cross entropy treats the output as a probability and penalizes the logarithm of the probability assigned to the wrong class. The two losses encode fundamentally different assumptions about the data, which is why mixing them with the wrong output layer causes confusing training failures. Use mean squared error for regression problems where the target is a real number. Use cross entropy for classification problems where the target is a class label.

The gradient difference between the two losses explains why mean squared error fails for classifiers paired with sigmoid or softmax. As predictions saturate near zero or one, the sigmoid derivative collapses toward zero, which kills the gradient of mean squared error along with it. Cross entropy avoids this saturation trap because the logarithm in its definition cancels the sigmoid factor during the chain rule. A network that uses mean squared error on a binary classifier can stall for thousands of iterations on examples where the prediction is wrong but already near zero or one. The same network trained with binary cross entropy escapes that stall in a handful of epochs, which is why the loss choice matters more than most beginners realize.

Connecting Cross Entropy to KL Divergence and Log Loss

Across most production stacks, cross entropy is closely related to Kullback-Leibler divergence, which measures how much one distribution differs from another. The exact relationship is that cross entropy of p and q equals the entropy of p plus the KL divergence from p to q. When p is fixed, as it is for a one-hot label, the entropy term is a constant and minimizing cross entropy is identical to minimizing KL divergence. That equivalence explains why so much theoretical work on classification phrases its results in terms of KL divergence even when the implementation uses cross entropy. The two losses are mathematically the same gradient target under the standard supervised setup.

Log loss is a third name for the same idea, used most often in tabular machine learning and Kaggle competitions. Log loss is simply the average negative log probability of the correct class across all samples, which is identical to categorical cross entropy with one-hot labels. The Wikipedia article on cross entropy and its derivation walks through the equivalence in detail. Practitioners often see the three names used interchangeably in research papers, package documentation, and Kaggle leaderboards. Cross entropy, log loss, and negative log likelihood are the dominant phrases.

The KL framing also clarifies why label smoothing improves calibration. Standard cross entropy pushes the model toward a Dirac one-hot output that places all probability on one class. KL divergence between a one-hot target and a sharp model is small only when the model is also nearly one-hot, which encourages overconfidence. Label smoothing replaces the hard one-hot target with a softened distribution that places a small amount of mass on every class. The KL minimization target becomes a softer distribution that the model can approach without becoming pathologically confident on training data.

The same KL framing explains why knowledge distillation works. A teacher model produces a soft probability distribution that captures more information than a one-hot label, because the wrong classes also carry useful signal about similarity. Training a student to minimize KL divergence from the teacher output transfers that extra information. The student often reaches higher accuracy than training the same architecture on hard labels alone, which is the central observation in the label smoothing analysis by Müller, Kornblith, and Hinton. Cross entropy with soft targets is the operational backbone of distillation.

Numerical Stability and Why Logits Beat Probabilities

Looking past the formula, numerical stability is the practical reason every modern framework asks for logits rather than probabilities at the loss function input. Softmax involves dividing exponentials of large numbers, which can overflow to infinity or underflow to zero in 32-bit floating point. Once a value becomes infinite or zero, the gradient becomes meaningless and the training run silently breaks. The standard fix is the log-sum-exp trick, which subtracts the maximum logit from all logits before exponentiating. The math is identical to the naive version, but every intermediate value stays in a numerically safe range.

The log-sum-exp version is what lives inside CrossEntropyLoss, BCEWithLogitsLoss, and TensorFlow’s sparse softmax cross entropy functions. The user passes raw logits, the loss function computes softmax and cross entropy together with stable arithmetic, and the gradient is computed analytically using the prediction minus truth identity. Implementing the same chain by hand using separate softmax and log calls almost always introduces enough error to derail training in mixed precision. The performance difference is most visible in transformer training on 16-bit floats, where naive softmax overflows constantly and the fused stable kernel does not.

Practitioners who skip the logit interface routinely write bugs that are nearly impossible to diagnose from training curves alone. A model that suddenly diverges after thousands of steps often has an underflowed log term silently producing NaN gradients in a single layer. The fix is almost always to switch from a hand-rolled softmax-then-cross-entropy pipeline to the fused logit form provided by the framework. The PyTorch documentation explicitly recommends the logit form in the section on CrossEntropyLoss numerical considerations. The recommendation is not a suggestion in production code, it is a hard requirement.

Handling Class Imbalance With Weighted and Focal Loss

Once you understand the basics, class imbalance is the most common reason a cross entropy classifier disappoints in production despite strong validation metrics. When one class dominates the training set, vanilla cross entropy lets the model collapse onto the majority class and still report low average loss. The classic fix is class weighting, where each example’s loss contribution is multiplied by an inverse-frequency weight that elevates rare classes. PyTorch exposes this through the weight argument of CrossEntropyLoss, and the same pattern exists in TensorFlow and scikit-learn. Class weighting solves the problem for moderately imbalanced datasets like medical screening labels that run roughly ten to one.

Severe imbalance, like the 1000 to 1 ratios in object detection between foreground and background anchors, requires more aggressive treatment. Focal loss was introduced by Lin et al. at ICCV 2017 for dense object detection, where it powered the RetinaNet detector to a COCO AP of 39.1, beating the previous best one-stage detector DSSD by 5.9 AP points. The trick is to multiply standard cross entropy by a modulating factor (1 – p)^gamma, where p is the predicted probability of the true class and gamma is a tunable hyperparameter. Easy examples with high p get nearly zero loss contribution, while hard examples with low p still carry full weight. Setting gamma to zero recovers standard cross entropy as a special case.

Focal loss has since spread beyond detection into medical imaging, fraud detection, and many imbalanced tabular problems. A study on canine red blood cell morphology classification with focal loss showed that a focal loss CNN beat a cross entropy CNN on F1 score for rare cell types in a long-tailed dataset. The authors reported a measurable improvement on minority classes that had been almost invisible to a plain cross entropy model. The lesson is that the right loss function can recover signal that no amount of architecture tweaking would surface. Imbalance is fundamentally a loss problem, not a model problem.

The Ultralytics glossary entry on focal loss notes that the modulating factor and the optional alpha balance parameter together let teams tune their classifier behavior class by class. Practitioners typically sweep gamma values between 0.5 and 5 to find the best operating point for their dataset. The focal loss recipe is now the default starting point for any classification problem where the dominant class outnumbers the minority class by more than 100 to 1. Engineers should still validate the choice with calibration curves and precision-recall curves rather than relying on raw loss values, which can mislead in severely imbalanced settings.

Label Smoothing and Model Calibration

From a practical standpoint, label smoothing is a small change to the target distribution that yields outsized benefits for calibration and generalization. Instead of training against a one-hot label that places probability 1 on the true class, label smoothing places probability 1 minus epsilon on the true class and spreads epsilon across the remaining classes. Typical values of epsilon range from 0.05 to 0.1 in the deep learning literature. The change prevents the model from chasing infinity-confidence outputs that overfit the training set. The original recipe appears in the Inception paper, where Szegedy et al. report 0.2 percent top-1 and top-5 improvements on ILSVRC 2012 with epsilon equal to 0.1.

Subsequent research has shown that label smoothing improves calibration by producing predicted probabilities that better match observed accuracy across confidence buckets. The same Müller, Kornblith, and Hinton analysis cited earlier shows that label smoothing makes representations more concentrated and easier to interpret. The catch is that smoothed teacher models transfer less information to distilled students because their soft labels look too uniform. Teams that use distillation pipelines often disable label smoothing in the teacher and re-enable it for the student to recover both benefits. The trade-off is now a standard topic in any imaging or speech recognition training recipe that aims for calibrated probabilities.

Common Pitfalls and Risks When Using Cross Entropy Loss

Stepping back, the most common pitfall when using cross entropy loss is double-applying softmax. Frameworks like PyTorch already include softmax inside CrossEntropyLoss, so passing softmax outputs into the loss applies the operation twice and produces a much flatter probability distribution. The model still trains but learns slowly because gradients become tiny near saturation. The fix is to pass raw logits and let the loss function handle softmax internally. The same mistake shows up in TensorFlow when users pass softmax outputs into SparseCategoricalCrossentropy with from_logits=False instead of using from_logits=True with raw logits.

A second pitfall is using cross entropy without checking class balance, then being surprised when the model predicts the majority class for everything. The model is doing exactly what cross entropy asked of it on the training set, which is to minimize average log probability of the correct class. A precision-recall curve, a confusion matrix, or a stratified metric tells the real story. Use the the precision-recall curve guide as a sanity check on every imbalanced classifier before declaring victory based on loss values alone. Loss curves can lie, but precision-recall does not.

A third pitfall is letting the loss function silently ignore mislabeled examples. Cross entropy is very intolerant of label noise because the negative log of a tiny probability assigned to the correct class produces a huge loss. A handful of corrupt labels can dominate the average and steer the optimizer in the wrong direction. Robust variants like generalized cross entropy and symmetric cross entropy were designed to handle this case, but the simpler fix is usually to clean up the training set. Quality of labeled data matters more than any clever loss function in noisy domains.

Ethical Stakes of Confident Wrong Predictions

When teams ship classifiers, confident wrong predictions from a cross entropy trained model carry real ethical weight in safety critical domains. A medical screening model that predicts cancer-free with 99 percent confidence on a patient who actually has cancer can delay treatment by months. A self-driving perception system that predicts pedestrian-absent with 99 percent confidence can cause a crash. The loss function does not know about these consequences, but the engineer who chose it does. Calibration techniques like label smoothing, temperature scaling, and Platt scaling exist precisely to soften overconfident outputs and align stated probabilities with observed accuracy.

Beyond calibration, model deployers should reason about which kinds of errors carry the largest cost and how a cross entropy classifier distributes its errors. A model trained on plain cross entropy might be 99 percent accurate overall while missing 50 percent of the rare positive class that actually matters. Weighted cross entropy or focal loss shifts that error distribution toward the practitioner’s actual priorities. Subgroup analysis across demographics, geographies, and time periods catches the failure modes that average loss values hide. The loss function is a starting point for ethical model deployment, not the ending point.

The Future of Loss Functions in Machine Learning

Putting the math aside, the future of loss functions in machine learning is moving toward principled generalizations of cross entropy that handle imbalance, noise, and uncertainty in one unified framework. The unified focal loss family, introduced by Yeung et al. for imbalanced medical image segmentation, generalizes Dice loss and cross entropy under a single set of hyperparameters. The framework recovers binary cross entropy, focal loss, Tversky loss, and Dice loss as special cases. Teams can tune one set of knobs to navigate the full design space rather than picking one loss and hoping for the best. The same idea is gaining traction in tabular classification and retrieval problems.

Another frontier is loss functions designed for the long-tail distribution of language model outputs. Causal language models trained with token-level cross entropy ignore the heavy-tail structure of natural language and overweight common tokens. Researchers are experimenting with rebalanced cross entropy, distillation losses from larger teachers, and reinforcement learning from human feedback to fix these biases. The OpenAI fine-tuning literature documents specific cases where switching from standard cross entropy to a reward-aware objective improved factuality and safety. The shift is reshaping what loss design looks like at the foundation model scale.

The third trend is automated loss design using meta learning. Frameworks like AutoML-Loss search over parameterized loss families to find the one that best matches a given task. Cross entropy and its variants are usually the search space’s foundation, which is a vote of confidence in the original recipe. The current consensus is that no other loss has matched the combination of mathematical elegance, optimization stability, and empirical effectiveness that cross entropy has provided for the last 30 years. New variants will continue to appear, but they will look like cross entropy in disguise.

Loss Function Performance Gains on Imbalanced Benchmarks

Reported improvements when switching from cross entropy to a targeted loss variant.
Cross entropy baseline Targeted loss variant
RetinaNet on COCO (AP)+5.9 AP
Inception v3 on ImageNet (top-1)+0.2 pp
Inception v3 on ImageNet (top-5)+0.2 pp
Canine RBC morphology (F1, rare class)Higher with focal
Medical segmentation (Dice, BRATS/Polyp)Multi-point gain
<iframe src=”https://www.aiplusinfo.com/blog/cross-entropy-loss-and-uses-in-machine-learning/?embed=chart” width=”100%” height=”540″ frameborder=”0″ loading=”lazy”></iframe> <p>Chart: <a href=”https://www.aiplusinfo.com/blog/cross-entropy-loss-and-uses-in-machine-learning/”>Loss Function Performance Gains on Imbalanced Benchmarks</a> via aiplusinfo.com</p>

How to Implement Cross Entropy Loss in PyTorch

Step 1 – Choose the right loss class for your problem

Pick the right loss class before writing any training code. Use BCEWithLogitsLoss for binary classification with sigmoid outputs, CrossEntropyLoss for multiclass classification with softmax outputs, and NLLLoss only when you have already computed log-softmax outside the loss. The fused logit variants are numerically stable and always preferred over their non-logit counterparts. The wrong choice silently degrades training quality and surfaces only when validation accuracy underwhelms.

Step 2 – Pass raw logits, never softmax outputs

Always pass raw network logits as input to the loss. The loss class applies log-softmax internally with the log-sum-exp trick for numerical stability. Passing a softmax probability vector instead of logits applies softmax twice and produces a flatter distribution that slows training. The PyTorch documentation is explicit about this requirement, but it remains the most common bug in tutorial code copied from older sources.

import torch
import torch.nn as nn

# Multiclass classifier with 10 classes
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10),  # raw logits, no softmax
)
criterion = nn.CrossEntropyLoss()

logits = model(x)              # shape [batch, 10]
loss = criterion(logits, target)  # target is integer class index
loss.backward()

Step 3 – Add class weights for imbalanced data

Imbalanced datasets need class weights or they degenerate into majority-class predictors. Compute inverse-frequency weights from the training set and pass them through the weight argument of CrossEntropyLoss. The weight tensor must be on the same device as the model and have one entry per class. The optimizer then puts more emphasis on rare classes during gradient updates, which lifts recall without harming precision on the common classes.

import torch
import torch.nn as nn

# Class frequencies from training set
counts = torch.tensor([900., 80., 20.])
weights = 1.0 / counts
weights = weights / weights.sum() * len(counts)  # normalize

criterion = nn.CrossEntropyLoss(weight=weights.cuda())

Step 4 – Add label smoothing for calibration

Label smoothing is built into CrossEntropyLoss as a single keyword argument. Pass label_smoothing=0.1 to soften the target distribution and improve calibration. The default of 0 reproduces standard cross entropy with hard one-hot targets. Most large-scale image and text classifiers use a value between 0.05 and 0.1 after years of consistent results in research benchmarks. Calibrated outputs improve downstream decision systems that consume the model’s probabilities.

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

Step 5 – Use focal loss for extreme imbalance

PyTorch does not ship a built-in focal loss, but the implementation is a few lines on top of cross entropy. Compute per-example cross entropy with reduction set to none, then multiply by the focal modulating factor and average across the batch. Set gamma to 2 as a reasonable default, which is what the RetinaNet paper used for COCO. Tuning gamma between 0.5 and 5 on a validation set finds the right operating point for any given dataset.

import torch
import torch.nn.functional as F

def focal_loss(logits, targets, gamma=2.0, alpha=None):
    ce = F.cross_entropy(logits, targets, reduction='none')
    p_t = torch.exp(-ce)              # prob of true class
    focal = ((1 - p_t) ** gamma) * ce
    if alpha is not None:
        at = alpha[targets]
        focal = at * focal
    return focal.mean()

Step 6 – Verify gradients with a sanity check

Run a single forward and backward pass on a synthetic batch and inspect the gradient magnitudes. A correctly wired cross entropy classifier should show gradients on the final layer roughly equal to the softmax prediction minus the one-hot target. Gradients several orders of magnitude larger or smaller indicate a logit-versus-probability bug, a label format bug, or a weight scale mismatch. Catching these issues at sanity check time saves hours of confused debugging on real data later.

Key Insights on Cross Entropy Loss

  • According to the GeeksforGeeks tutorial on cross entropy loss, the function reduces to negative log of the predicted probability of the true class. A model assigning 0.5 to the correct class produces a loss of 0.69 nats, the standard baseline number for a binary task.
  • The PyTorch CrossEntropyLoss reference documentation shows the binary cross entropy loss formula BCE = -[y log(p) + (1-y) log(1-p)] is the foundation for every two-class neural classifier. Production stacks use the logit form to keep numerical stability across millions of training steps.
  • The step-by-step softmax cross entropy walkthrough by Paras Dahal derives the gradient simplification to prediction minus truth. That identity is why softmax and cross entropy are paired together in nearly every classification network shipping today.
  • The Lin et al. ICCV 2017 focal loss paper showed focal loss lifted COCO AP by 5.9 points over the previous one-stage detector best. RetinaNet reached AP 39.1, marking a step change in dense object detection benchmarks.
  • The Inception architecture paper by Szegedy et al. reported label smoothing with epsilon 0.1 produced a 0.2 percent absolute top-1 and top-5 improvement on ILSVRC 2012. The recipe still ships in production image classifiers across cloud vision APIs today.
  • The canine red blood cell focal loss study by Pasupa and colleagues demonstrated focal loss CNN beat cross entropy CNN on F1 score for rare cell types. The result rescued minority classes that the cross entropy baseline ignored completely.
  • The Wikipedia cross entropy article documents that cross entropy equals KL divergence plus the entropy of the true distribution. That identity drives every theoretical result on calibration in classification literature today.
  • The Yeung et al. unified focal loss paper introduced a generalized family that subsumes Dice loss, cross entropy, and Tversky loss under one set of hyperparameters. The family has improved imbalanced medical image segmentation across the BRATS and CVC-ClinicDB datasets.

The picture that emerges across these insights is consistent. Cross entropy loss is mathematically the right objective for probabilistic classification because it inherits the clean gradient properties of maximum likelihood estimation. Practitioners get into trouble when they treat cross entropy as a black box rather than understanding the assumptions baked into the formula. The most productive engineers learn the binary cross entropy loss formula, the softmax gradient identity, and the focal loss modulating factor as a connected family. That family is the backbone of every modern classifier you will ever debug. Mastering these three concepts puts you ahead of most teams shipping classification systems today.

How Cross Entropy Compares to Other Loss Functions

Loss FunctionGradient BehaviorOutput RangeBest Use CaseCalibration QualityImbalance HandlingCommon Failure Mode
Cross EntropyBounded between -1 and 1 per logit0 to infinityMulticlass classification with softmaxOverconfident without label smoothingPoor; needs weightingSaturation if used with sigmoid+MSE pipeline
Binary Cross EntropyBounded per output unit0 to infinityBinary or multilabel classification with sigmoidOverconfident without smoothingPoor; needs weightingUnderflow with extreme logits if used without logit form
Mean Squared ErrorVanishes near saturation0 to infinityRegression on continuous targetsNaturally calibrated for Gaussian targetsNoneStalls on saturated sigmoid outputs in classification
Focal LossDownweights easy examples by (1-p)^gamma0 to infinityDense detection or severe class imbalanceBetter than plain cross entropy on rare classesExcellentHyperparameter sensitivity for gamma
Hinge LossZero past margin, constant before0 to infinitySupport vector machinesNo probability interpretationPoorNo useful probabilities for downstream decisions
KL DivergenceEquivalent to cross entropy with soft targets0 to infinityKnowledge distillation, soft label trainingInherits teacher calibrationDepends on teacherNumerical issues if target distribution has zeros
Label-Smoothed CESame as CE but against softened target0 to infinityImage and text classifiers needing calibrationStrong calibrationMild improvementHurts knowledge distillation if applied to teacher

Real World Examples of Cross Entropy Loss in Production

Example: Spam Filtering at Scale With Binary Cross Entropy

In modern training pipelines, most experienced practitioners agree that email spam filtering deployed across hundreds of millions of inboxes relies on binary cross entropy to train sigmoid classifiers that score each incoming message. The implementation typically uses BCEWithLogitsLoss in PyTorch or its TensorFlow equivalent, feeding raw logits from a transformer encoder into the loss function. The measurable outcome at major providers includes false positive rates below 0.1 percent and spam catch rates above 99 percent on benchmark sets. The limitation is that the loss treats every error as equal cost, when in reality a legitimate email marked as spam often hurts users more than a spam email reaching the inbox. Teams compensate by tuning the decision threshold rather than the loss itself, using the framework discussed in the PyTorch loss documentation. The result is a robust pipeline that ships across consumer email products.

Example: ImageNet Classification With Categorical Cross Entropy

ImageNet classification is the canonical multiclass cross entropy problem with 1000 categories and 1.28 million training images. Modern training recipes pair sparse categorical cross entropy with label smoothing epsilon of 0.1 to lift top-1 accuracy by roughly 0.2 percent on ResNet-50 and similar architectures, the exact gain reported in the Inception architecture paper. The implementation uses the fused logit form of softmax cross entropy in every major training stack. The limitation is that ImageNet labels are noisy at the long tail, where similar categories like ten kinds of terrier confuse human labelers. Cross entropy treats these noise patterns as real signal, which can cap top-1 accuracy below the human label noise ceiling. The recipe still defines what production image classifiers ship in 2024 across cloud vision APIs.

Example: Object Detection With RetinaNet Focal Loss

RetinaNet is the canonical example of focal loss replacing plain cross entropy in a production object detection system. The architecture trains on dense anchor boxes where the background class outnumbers foreground objects by roughly 1000 to 1, and standard cross entropy collapses onto the trivial background prediction. Focal loss with gamma equal to 2 lifted RetinaNet COCO AP to 39.1 versus 33.2 for the previous best one-stage detector DSSD, an exact 5.9 AP point gap reported in the ICCV 2017 focal loss paper by Lin et al.. The limitation is that gamma needs careful tuning, and a poorly chosen value can hurt easier datasets that do not have severe imbalance. Production teams now use focal loss as the default starting point for any dense detector and validate gamma on a held out set.

Lessons From Cross Entropy Loss Deployments

Case Study: Medical Imaging Long-Tail Classification With Unified Focal Loss

Across the deep learning community, medical image segmentation teams face severe class imbalance because the disease region usually covers a tiny fraction of each scan. The problem was tackled by Yeung et al. in the unified focal loss paper, where they introduced a family that generalizes Dice loss, Tversky loss, focal Tversky loss, and cross entropy under one set of hyperparameters. The solution allowed practitioners to tune one set of knobs rather than committing to a single loss family at the start. The measurable impact across the BRATS brain tumor and CVC-ClinicDB polyp datasets was a Dice score improvement of several points over plain Dice or cross entropy baselines. The reported numbers covered both small and large lesion sizes, demonstrating that the framework handles different imbalance regimes well.

The limitation flagged by the authors is that the unified family adds new hyperparameters that need cross validation, which raises training cost and complicates reproducibility. Teams adopting unified focal loss have reported strong production gains but warn that tuning takes several days of compute on a single GPU. The lesson for engineering organizations is that loss design is a real research investment, not a free parameter. The case shows that cross entropy is not the end of the story for severely imbalanced classification, and that practitioners willing to learn the unified focal family can extract meaningful gains. The lesson generalizes beyond medical imaging to any imbalanced segmentation problem in industry or research.

Case Study: Label Smoothing Calibration in Speech Recognition

Speech recognition models trained with token-level cross entropy on transcripts have long suffered from overconfident posteriors that mislead downstream language models. The problem analyzed in the Müller, Kornblith, and Hinton label smoothing paper showed that label smoothing with epsilon between 0.05 and 0.1 produced strikingly better expected calibration error on both image and speech tasks. The solution required only a single hyperparameter change in the loss configuration, with no model architecture modifications needed. The measurable impact was a drop in expected calibration error of roughly 50 percent on the speech benchmarks the authors reported, alongside small gains in downstream word error rate when paired with a language model rescoring pass. The combined gain made label smoothing a standard component of large-scale speech training recipes.

The limitation that surfaced in the same paper was that label smoothing hurt knowledge distillation downstream because the teacher’s softened probabilities lost discriminative information about wrong classes. Teams that distill speech models often disable label smoothing in the teacher and re-enable it for the student to capture both benefits. The lesson for production teams is that loss tweaks interact in nontrivial ways across pipeline stages, so any calibration intervention needs to be evaluated end to end rather than at a single stage. The story shows that even small changes to cross entropy can yield large measurable wins in calibration when applied with care. The case is now a standard reference in any speech or large language model training plan.

Case Study: Veterinary Hematology With Focal Loss CNNs

A 2020 study on canine red blood cell morphology compared focal loss and cross entropy CNNs on a long-tailed dataset of microscope images. The problem was that normal cells dominated the training distribution while clinically important malformations like dacryocytes and schistocytes appeared in fewer than 1 percent of samples. The solution applied focal loss with gamma equal to 2 on top of a ResNet-style backbone and used standard cross entropy on a matched baseline for comparison. The measurable impact was a higher F1 score for the rare cell types under focal loss, as detailed in the Pasupa et al. canine RBC morphology study. The improvement translated into a clinically meaningful tool that flagged unusual cells for veterinary review, which the cross entropy model had missed entirely.

The limitation that the authors discussed is that focal loss requires gamma tuning, and the same value did not transfer cleanly between human and canine cell datasets they tried. The lesson for cross domain transfer is that even within a tight subdomain like blood cell morphology, hyperparameters of the loss function need re-validation. The case demonstrates how thoughtful loss selection rescues rare class performance that plain cross entropy ignores, and how this matters for downstream clinical decisions. The result is a useful tool in veterinary diagnostics that would not exist if the team had stuck with cross entropy alone. The veterinary example shows the same pattern that detection teams saw with RetinaNet, just at a much smaller scale.

Frequently Asked Questions About Cross Entropy Loss

What is cross entropy loss in machine learning?

Cross entropy loss is a function that measures how different the predicted probability distribution is from the true label distribution. It penalizes confident wrong predictions sharply through a negative log term that grows toward infinity as the predicted probability of the correct class approaches zero. The function is used to train almost every modern classifier, from binary spam filters to large multiclass image models. Lower cross entropy means the model assigns more probability to the correct class for each example.

What is the binary cross entropy loss formula?

The binary cross entropy loss formula for a single example is BCE = -[y log(p) + (1-y) log(1-p)], where y is the true binary label and p is the predicted probability of the positive class. The full batch loss is the average of this expression across all samples in the batch. The formula reduces to negative log p when y is 1 and negative log (1-p) when y is 0. The shape grows toward infinity as the prediction moves away from the truth, which is why models train fast against confident wrong predictions.

What is the difference between binary and categorical cross entropy?

Binary cross entropy applies to classification problems with exactly two outcomes and pairs with a sigmoid activation on a single output neuron. Categorical cross entropy applies to multiclass problems with three or more outcomes and pairs with a softmax activation across all class neurons. Binary cross entropy computes one penalty per example based on the sigmoid output and the binary label. Categorical cross entropy computes the negative log of the predicted probability for the true class only, since the one-hot label zeros out the other terms. The two losses are equivalent when the multiclass problem has exactly two classes.

Why use cross entropy loss instead of mean squared error?

Cross entropy gives a much stronger gradient signal than mean squared error when predictions are wrong because of the logarithm at its core. Mean squared error gradients shrink toward zero as sigmoid or softmax outputs saturate, which stalls training on examples that need the most correction. Cross entropy avoids this trap because the log term cancels the activation derivative during the chain rule. The result is that classifiers trained on cross entropy converge in a fraction of the iterations needed by mean squared error on the same problem. The combination of softmax and cross entropy also produces a clean prediction-minus-truth gradient.

Is cross entropy loss the same as log loss?

Cross entropy loss and log loss are different names for the same mathematical object in the context of supervised classification with one-hot labels. Log loss is the dominant name in Kaggle competitions and the scikit-learn API, while cross entropy is the name used in deep learning frameworks. Negative log likelihood is a third name used in probabilistic modeling courses. All three refer to the average negative log probability the model assigned to the correct class. The naming convention varies by community but the formula does not.

What is a good cross entropy loss value?

A good cross entropy loss value depends entirely on the number of classes and the difficulty of the task. For a balanced binary problem, a value below 0.4 usually signals a useful classifier and a value near 0.69 indicates random guessing. For multiclass problems, the random baseline equals the natural log of the number of classes, so 1000-class ImageNet has a chance baseline near 6.9. Production image classifiers typically reach values between 0.5 and 1.5 depending on the architecture and dataset size. Compare against the chance baseline and a strong baseline model rather than fixating on absolute numbers.

How is the cross entropy loss gradient computed?

The cross entropy loss gradient with respect to each output logit equals the softmax probability for that class minus the one-hot label entry. This identity is the central reason softmax and cross entropy are paired together in nearly every classifier. The simplification falls out of the chain rule when you treat softmax-then-cross-entropy as one operation. The bounded form of the gradient stabilizes training and prevents catastrophic updates from any single example. Implementations in PyTorch and TensorFlow use this analytical form for both speed and numerical stability.

When should I use focal loss instead of cross entropy?

Use focal loss instead of cross entropy when one class outnumbers another by more than 100 to 1, especially in dense object detection or rare class medical screening. Plain cross entropy collapses onto the majority class in these regimes and produces low loss values with useless predictions. Focal loss reweights examples by their confidence so that hard misclassified samples dominate the gradient. The recipe in the original RetinaNet paper used gamma equal to 2 and produced a 5.9 AP point COCO improvement. Validate the gamma value on your own holdout set before deploying to production.

What is label smoothing in cross entropy loss?

Label smoothing replaces hard one-hot labels with a softened distribution that places probability 1 minus epsilon on the true class and spreads epsilon across the remaining classes. The typical value of epsilon is 0.1 in image classification and 0.05 in language modeling. The change improves calibration by preventing the model from chasing infinite confidence on training examples. The Inception paper reported 0.2 percent top-1 and top-5 gains on ILSVRC 2012 with this setting. PyTorch CrossEntropyLoss exposes label smoothing through a single keyword argument.

Why do frameworks ask for logits instead of probabilities?

Frameworks ask for logits because the fused logit form of cross entropy loss uses the log-sum-exp trick to avoid floating point overflow and underflow. A naive softmax-then-log pipeline can produce NaN gradients silently when extreme logits push exponentials beyond float32 limits. The logit form keeps every intermediate value in a safe range and produces identical math to the textbook version. PyTorch CrossEntropyLoss and BCEWithLogitsLoss both follow this convention. Using the logit form is treated as a hard requirement in production training pipelines.

Can cross entropy loss go negative?

Cross entropy loss cannot go negative when both inputs are valid probability distributions, because the negative log of a probability between zero and one is always nonnegative. The loss reaches zero only when the model places probability one on the correct class for every example. Any imperfect prediction produces a strictly positive value. Negative values usually indicate a code bug, such as passing log probabilities into a function that expects raw probabilities. The mathematical floor of zero is one of the reasons cross entropy is convenient as a training target.

What is the relationship between cross entropy and KL divergence?

Cross entropy of p and q equals the entropy of p plus the KL divergence from p to q. When p is a fixed one-hot label, the entropy term is zero and minimizing cross entropy becomes identical to minimizing KL divergence from the label to the model output. That equivalence is why theoretical work often phrases classification in KL terms while practitioners implement cross entropy. The same identity also underpins knowledge distillation, where a soft teacher distribution replaces the one-hot label. The relationship is fundamental to almost every calibration result in classification.