Introduction
Overfitting vs underfitting in machine learning is the central engineering puzzle that decides whether a trained model will generalize or fail in production. An overfit model memorizes training noise and falls apart on new data, while an underfit model never learns the underlying signal in the first place. The classic scikit-learn underfitting overfitting bias variance tradeoff documentation example shows a degree-1 polynomial missing the curve and a degree-15 polynomial chasing every data point. The 2022 Kaggle State of Data Science survey logged more than 23,000 working practitioners, and overfitting remained the most cited barrier to deployment for tabular and vision models. This guide explains both failure modes, the bias-variance tradeoff that drives them, and the regularization and cross-validation tools that fix them. You will see hands-on scikit-learn code, learning-curve diagnostics, and named case studies from Zillow, IBM, and the COVID radiology literature. By the end you will know how to spot, diagnose, and prevent both overfitting and underfitting in machine learning workflows.
Quick Answers on Overfitting and Underfitting in Machine Learning
What is overfitting in machine learning?
Overfitting in machine learning occurs when a model memorizes training data, including noise, and fails to generalize. Training accuracy stays high while validation accuracy collapses, signalling poor real-world performance.
What is underfitting in machine learning?
Underfitting in machine learning happens when a model is too simple to capture the underlying pattern. Both training and validation errors stay high, indicating insufficient capacity, missing features, or too little training.
What is the difference between overfitting and underfitting?
The difference is the train-validation gap. Overfitting shows a wide gap with low train error and high validation error. Underfitting shows both errors high together because the model never learned the underlying machine learning signal.
Key Takeaways
- Overfitting and underfitting in machine learning are diagnosed from the gap between training error and validation error on a learning curve.
- Regularization, dropout, early stopping, and cross-validation are the standard tools that close the gap and restore generalization.
- Hyperparameter tuning can help us deal with both overfitting and underfitting by adjusting model capacity, learning rate, and regularization strength together.
- The bias-variance tradeoff explains why overly simple models underfit while overly complex models overfit on the same training set.
Table of contents
- Introduction
- Quick Answers on Overfitting and Underfitting in Machine Learning
- Key Takeaways
- What Is the Overfitting vs Underfitting in Machine Learning Tradeoff
- Putting Overfitting vs Underfitting in Machine Learning Onto a Learning Curve
- The Bias-Variance Tradeoff That Sits Behind Both Failures
- What Causes Overfitting in Real Training Runs
- What Causes Underfitting and Why It Hides in Plain Sight
- Regularization Methods That Reduce Overfitting Without Hurting Capacity
- How Hyperparameter Tuning Can Help Us Deal With Both Overfitting and Underfitting
- Cross-Validation as the Workhorse Diagnostic for Overfitting
- Detecting Overfitting in Scikit-Learn With Validation Curves
- How to Detect Overfitting in Deep Learning With Early Stopping
- Data-Side Levers: Augmentation, Cleaning, and Class Balance
- Risks of Shipping an Overfit Model to Production
- Ethics of Overfitting to the Majority and Underfitting Minority Groups
- The Future of Generalization: Double Descent, Foundation Models, and AutoML
- Key Insights on Overfitting and Underfitting in Machine Learning
- Comparing Strategies for Overfitting vs Underfitting in Machine Learning at a Glance
- Overfitting and Underfitting Examples From Real Machine Learning Projects
- Case Studies Where Overfitting and Underfitting Changed Outcomes
- Frequently Asked Questions on Overfitting and Underfitting in Machine Learning
What Is the Overfitting vs Underfitting in Machine Learning Tradeoff
Overfitting vs underfitting in machine learning describes two opposite generalization failures caused by mismatched model capacity, where overfit models memorize noise and underfit models miss the signal. Both push validation error above the irreducible Bayes floor any model could reach.
An Interactive From AIplusInfo
The Bias-Variance Explorer
Move the sliders to see how model capacity and training data size push a fit toward underfitting, a good fit, or overfitting.
Controls
Results
Training error (RMSE on noisy sine)
Validation error (held-out RMSE)
Train-validation gap (overfitting signal)
GOOD FIT
Train and validation errors are close. Capacity, data, and regularization are balanced.
Source: model behavior pattern based on the scikit-learn underfitting vs overfitting documentation example.
Putting Overfitting vs Underfitting in Machine Learning Onto a Learning Curve
Building on the formal definitions, the fastest way to diagnose overfitting and underfitting in machine learning is to plot training error and validation error against the number of training examples. A wide and persistent gap between the two curves, with training error near zero and validation error stuck high, is the signature of overfitting. A flat pair of curves that converges quickly but plateaus above an acceptable error floor is the signature of underfitting. The scikit-learn library exposes this diagnostic through the learning_curve helper, which trains the estimator on progressively larger subsets and records both errors. Reading the resulting plot takes seconds but tells you whether you need more data, more capacity, or more regularization. Practitioners often pair it with a complementary use cross-validation to reduce overfitting workflow before changing any hyperparameter.
A useful numerical heuristic is to flag any train-validation gap above ten percentage points as overfitting. A joint error above twenty percent on a balanced classification task often signals underfitting. These thresholds vary by domain, dataset noise floor, and class imbalance. A noisy click-prediction dataset can show a five-point gap that still represents serious leakage, while a clean computer vision benchmark may comfortably tolerate two points. The point is that learning curves give you concrete, measurable signals instead of vague intuition about model quality. Teams that adopt them early in their workflow avoid spending weeks chasing fixes that never address the real failure mode.
The Bias-Variance Tradeoff That Sits Behind Both Failures
Shifting from diagnostics to theory, the bias-variance tradeoff is the formal decomposition that explains why overfitting and underfitting exist at all. Expected test error decomposes into bias squared, variance, and an irreducible noise term that no model can ever reduce. Bias measures how far the average prediction sits from the true target, while variance measures how much the prediction wobbles across different training samples. A high-bias model is too rigid to track the signal, which is exactly what underfitting looks like in practice. A high-variance model is too flexible and starts memorizing the noise, which is exactly what overfitting looks like in practice.
The classical picture of this tradeoff is a U-shaped curve where total error first drops as capacity grows and then rises again past a sweet spot. That sweet spot is where bias and variance are balanced against each other for the given dataset size. Statistician Trevor Hastie and his coauthors documented this decomposition extensively in The Elements of Statistical Learning, which remains a standard reference for the math behind the tradeoff. Their bias-variance plots make it clear that no single algorithm dominates across all data regimes. The dataset itself decides where the sweet spot sits, so engineers must measure rather than guess.
Beyond the textbook view, the tradeoff drives every practical regularization choice you make on a real project. L2 weight decay nudges a model toward lower variance at the cost of slightly higher bias, which often improves validation error overall. Cross-validation tells you exactly how that tradeoff lands on your particular task without requiring algebra. Engineers tune capacity, regularization, and data size together until validation error stops decreasing across folds. That iterative dance is what separates a polished production model from a fragile one that overfits at the first sign of drift.
Modern deep learning complicates the classical story because very wide overparameterized networks sometimes show double descent, a phenomenon where test error drops a second time past the interpolation threshold. Mikhail Belkin and colleagues documented this effect in their 2019 paper, where it appears across linear regression, random forests, and neural networks alike. The classical U-curve still describes most small and mid-sized models accurately enough for everyday work. Practitioners now use both lenses together, treating the bias-variance tradeoff as the default mental model and double descent as an exception to keep in mind. Reading a learning curve remains the cheapest way to find out which regime your current model is in.
What Causes Overfitting in Real Training Runs
Turning to the practical causes, overfitting in machine learning is rarely a single mistake and is usually the result of a chain of small decisions that compound over a project. The most common driver is using a model with too many parameters relative to the size of the training set. A deep neural network with millions of weights trained on a few thousand labeled rows will almost always memorize the data. Adding too many engineered features without any regularization has the same effect and is easy to miss in tabular pipelines. Engineers who reuse the same test set across many experiments also leak signal back into training without realizing it.
Data leakage is the silent overfitter that haunts production teams more than any architecture choice. Including future information in features, mishandling time-series splits, or letting target-correlated identifiers slip into inputs are classic patterns. The how data labeling drives model performance walkthrough shows how subtle annotation choices can preload a model with answers. A 2021 Nature Machine Intelligence review by Michael Roberts examined more than 60 COVID-19 imaging papers and found pervasive leakage that inflated reported accuracy by double-digit percentage points. Every one of those models looked great on paper and overfit catastrophically in the wild.
Training for too many epochs on the same data also produces overfitting, especially when no early stopping or validation-based checkpointing is used. Each pass through the data lets the model lock more tightly onto the specific noise pattern in that sample. Repeated hyperparameter searches against a single validation set effectively let the model fit that set too. Teams should rotate validation splits, freeze a final holdout, and resist the urge to tune until the validation curve looks artistic. Each of these habits closes one of the many doors through which overfitting enters a project.
What Causes Underfitting and Why It Hides in Plain Sight
Stepping back from overfitting, underfitting is often the quieter failure that costs teams real money before anyone notices it. The most direct cause is choosing a model class that is too simple for the task, such as a linear regression for a clearly nonlinear relationship. Skipping feature engineering on tabular data is another common driver because raw inputs rarely expose the structure that drives accurate predictions. Training for too few epochs or stopping early at the wrong point can also leave a deep model underfit. Excessive regularization, especially aggressive L2 weight decay or very high dropout rates, can pull a perfectly capable model into underfitting territory.
Underfitting hides because the resulting model still produces predictions that look plausible and ship without obvious errors. Training accuracy may sit at sixty or seventy percent, which feels reasonable until you compare against a stronger baseline. Many product teams have shipped underfit recommender or fraud systems for months before benchmarking against a deeper alternative. The fix is rarely glamorous and usually involves adding capacity, adding features, or reducing regularization step by step. A short experiment with a baseline like XGBoost on the same dataset will often reveal whether the current model is leaving accuracy on the table.
Regularization Methods That Reduce Overfitting Without Hurting Capacity
Beyond simply cutting model size, modern regularization gives you a way to keep capacity high while still controlling overfitting in machine learning systems. L2 weight decay adds a penalty proportional to the squared magnitude of the weights to the loss function, pulling them toward zero. L1 penalties produce sparser solutions and are useful when you suspect many features are irrelevant. Both penalties shrink coefficients without removing them, which keeps the model expressive while reducing variance. Engineers select the regularization strength through cross-validation, often sweeping over orders of magnitude before settling on a value.
Dropout extends the regularization toolkit into deep neural networks by randomly zeroing activations during training. Nitish Srivastava and coauthors introduced the technique in their 2014 JMLR paper, which showed test error drops of up to twenty percent on standard benchmarks. The intuition is that the network learns redundant pathways because no single neuron can be counted on, which forces robust features. Modern transformer training uses dropout rates between 0.1 and 0.3 across attention and feedforward sublayers. The batch normalization for stable training guide explains how related techniques work alongside dropout to stabilize learning.
Early stopping is a deceptively simple regularizer that monitors validation loss and halts training when it stops improving. It is essentially free, requires no architecture changes, and is supported by every major framework out of the box. Combined with a learning-rate schedule, early stopping often eliminates the need for heavy explicit regularization on medium-sized datasets. The technique pairs well with checkpointing so you can recover the best-performing weights even if later epochs drift upward. Many practitioners now run early stopping by default and only add L2 or dropout when validation loss still plateaus too high.
Data-augmentation regularization works at the input layer by generating synthetic variations of each training example. Image pipelines apply random crops, flips, and color jitter, while text pipelines add paraphrases and synonym swaps. Augmentation effectively grows the training set without new labels and is one of the cheapest ways to fight overfitting in vision tasks. The L0, L1, L2, and L-infinity norms explained primer covers the math behind the penalties that pair naturally with these techniques. Combining augmentation with weight decay and dropout typically delivers the strongest validation gains on convolutional and transformer image models.
How Hyperparameter Tuning Can Help Us Deal With Both Overfitting and Underfitting
Moving on to systematic tuning, hyperparameter tuning can help us deal with both overfitting and underfitting in machine learning by treating capacity, regularization, and optimization as one joint search space. Hyperparameters set the size of the model, the strength of regularization, the learning rate, and the optimizer choice. When all of these are tuned together, a search can move smoothly between underfit and overfit regions and land in the good-fit zone. Grid search and random search are the simplest implementations and remain useful for small spaces. Bayesian optimization, evolutionary search, and bandit methods like Hyperband scale better on large search spaces and modern hardware.
The scikit-learn library ships with several primitives that make this workflow straightforward to script. The classes GridSearchCV and RandomizedSearchCV run cross-validated searches and report the best parameter combination across folds. The XGBoost for tabular learning guide shows how tree-based models pair with these tuners on real datasets. Bayesian methods like Optuna and Hyperopt cut search budgets by an order of magnitude on neural network projects. The 2019 Optuna paper from Akiba and colleagues reported between two and ten times speedups on standard benchmarks compared to random search.
Treating hyperparameter tuning as a research loop rather than a one-shot script is what separates strong teams from average ones. Strong teams record every trial in an experiment tracker, plot validation error against each parameter, and watch for signs of overfitting to the validation set itself. They also hold out a final test split that is touched only after tuning is complete to get an honest estimate of generalization. Tuning without this discipline often produces a model that looks excellent on the validation set and underperforms in production. Hyperparameter tuning can help us deal with both overfitting and underfitting, but only when paired with rigorous validation strategy and honest reporting of results.
Cross-Validation as the Workhorse Diagnostic for Overfitting
Beyond simple train-test splits, cross-validation is the most reliable diagnostic for detecting overfitting and underfitting across small and medium datasets. K-fold cross-validation partitions the data into K equal folds and trains the model K times, holding out a different fold for validation in each round. The average validation error across folds gives a low-variance estimate of how the model will behave on new data. A standard choice is K equal to five or ten, balancing compute cost against estimate stability. Stratified K-fold preserves class proportions across folds, which matters for imbalanced classification.
Cross-validation is especially useful when you want to compare two models or two hyperparameter settings without spending a fresh test set. The average performance across folds is honest enough to drive decisions during development. Time-series tasks need a different scheme called walk-forward validation that respects temporal order. Standard K-fold leaks future information into the past, producing dangerously optimistic results on forecasting problems. The use cross-validation to reduce overfitting deep dive covers nested cross-validation for joint model selection and hyperparameter tuning in scientific pipelines.
Detecting Overfitting in Scikit-Learn With Validation Curves
Moving on to hands-on tooling, the scikit-learn validation curve is the canonical way to detect overfitting and underfitting while sweeping a single hyperparameter. The helper validation_curve trains the model across a range of parameter values and records training and validation scores. Plotting both scores against the parameter exposes the classic U-shape that marks the bias-variance tradeoff. Training score climbing while validation score falls is the unmistakable visual signature of overfitting. Both curves staying low and flat together signals underfitting and a need for more capacity.
The pattern works for any single-parameter sweep, such as maximum tree depth, regularization strength, polynomial degree, or hidden unit count. The official scikit-learn documentation includes a polynomial-regression example that walks through a degree-1 model that underfits, a degree-4 model that fits well, and a degree-15 model that overfits dramatically. The example pairs well with the broader common machine learning algorithms overview because it generalizes across estimators. Running this diagnostic before any heavy tuning saves hours of expensive grid-search compute. Engineers should treat validation curves as the first line of defense before reaching for more elaborate experiments.
Here is a minimal scikit-learn script that demonstrates the workflow end to end with a Ridge regression sweep over the alpha penalty. It generates a synthetic regression dataset with controlled noise so the curves stay easy to read. The validation_curve helper runs five-fold cross-validation across seven log-spaced alpha values automatically. Mean training and validation mean squared error print at each alpha, giving the reader an immediate sense of overfitting and underfitting behaviour. A real production script would also plot these scores and pick the alpha that minimizes validation error consistently across folds. The same template adapts to other estimators by swapping Ridge for an SVC, RandomForest, or GradientBoosting model with one line.
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import validation_curve
import numpy as np
X, y = make_regression(n_samples=400, n_features=20, noise=12.0, random_state=0)
param_range = np.logspace(-3, 3, 7)
train_scores, valid_scores = validation_curve(
Ridge(),
X, y,
param_name="alpha",
param_range=param_range,
cv=5,
scoring="neg_mean_squared_error",
)
for alpha, tr, va in zip(param_range, train_scores.mean(axis=1), valid_scores.mean(axis=1)):
print(f"alpha={alpha:.4f} train_mse={-tr:.2f} valid_mse={-va:.2f}")
How to Detect Overfitting in Deep Learning With Early Stopping
Building on the scikit-learn workflow, deep learning practitioners detect overfitting in real time by watching the validation loss curve during training and stopping when it stops improving. The Keras and PyTorch ecosystems both ship early-stopping callbacks that monitor a chosen metric and halt training after a configurable patience window. A typical patience value is five to twenty epochs, depending on dataset size and learning-rate schedule. The validation loss usually drops steadily for an initial run of epochs, plateaus, and then starts to rise as the network begins to overfit. That inflection point marks the best checkpoint to keep and ship.
Pairing early stopping with model checkpointing protects you from accidentally keeping a worse-performing later epoch. Frameworks expose a ModelCheckpoint callback that writes the best weights to disk so they survive a long run. The cross-entropy loss and its role in training primer explains why monitoring loss rather than accuracy gives a smoother signal for early stopping. On large foundation models, teams also monitor downstream evaluation metrics in addition to validation loss because the two can decouple. The PyTorch loss functions guide covers the API choices that make this monitoring straightforward.
Data-Side Levers: Augmentation, Cleaning, and Class Balance
Beyond architecture and regularization, the data itself is the most powerful lever for fixing overfitting and underfitting workflows. Collecting more high-quality labeled data is almost always the single best investment for an overfitting problem. The 2017 Hestness paper on neural scaling showed test loss falling as a power law with dataset size across speech, image, and language tasks. Doubling the training set size often beats months of architecture tweaking on validation accuracy. Image augmentation libraries like Albumentations and TorchVision give you cheap synthetic variants when fresh labels are expensive.
Cleaning the data closes the other half of the loop and prevents underfitting by removing label noise that confuses the model. Mislabeled examples force the model toward an impossible target and waste training capacity. Confident-learning libraries can flag suspect labels automatically by analyzing model disagreement across folds. The 2021 Andrew Ng data-centric AI talk argued that label quality often beats model choice on real-world tabular and audio tasks. Teams that audit their labels before retraining usually see immediate gains on validation accuracy.
Class imbalance can mimic both overfitting and underfitting depending on how the metric is computed. Naive accuracy on a ninety-ten split flatters a model that always predicts the majority class. The classification and regression trees primer explains how class weights and balanced metrics like F1 or AUC restore an honest evaluation. Resampling, focal loss, and synthetic minority oversampling techniques like SMOTE all help on imbalanced datasets. Teams should pick a metric that reflects the business cost of false positives and false negatives before tuning anything else.
Risks of Shipping an Overfit Model to Production
For teams that ignore the diagnostics, the real cost of overfitting and underfitting shows up only after a model reaches paying customers. An overfit fraud-detection model might flag legitimate transactions at high rates within the first week of deployment. An underfit medical screening model might miss the signs of disease the marketing brochure promised to catch. Both failures undermine trust and can trigger regulatory consequences in finance, healthcare, and hiring. The Zillow Offers shutdown in 2021 wiped out hundreds of millions of dollars and demonstrated what happens when a pricing model fails to generalize. Bad model decisions also produce hidden second-order costs in customer churn, support volume, and brand damage.
Calibration drift is a subtle production risk that turns even a well-tuned model into an overfit liability over time. The data distribution at inference time slowly diverges from the distribution the model saw during training. Without retraining, the model continues to emit confident predictions on inputs that no longer match its experience. Monitoring frameworks like Evidently and WhyLabs surface this drift through statistical distance metrics on feature and label distributions. Engineers should set alert thresholds tied to business KPIs rather than abstract distance numbers, so they catch drift before it damages outcomes.
Test-set contamination is another insidious production risk that appears once a model is deployed and queried at scale. Reusing the same evaluation set across many experiments turns it into a second training set, leaking signal into model selection. The how to defend against adversarial attacks on ML guide covers related test-time risks that exploit overfit decision boundaries. Engineers should rotate held-out splits, refresh test data periodically, and never tune against the final evaluation set. These disciplines are cheap to adopt and dramatically reduce production surprises.
Reputation and legal risk grow quickly once a production model produces biased or unfair predictions because of overfitting to spurious correlations. The 2019 Apple Card credit-limit controversy showed how an underspecified model can produce defensible-looking outputs that still trigger regulatory investigations. The New York State Department of Financial Services launched a public inquiry within days of the first complaints. Engineers and product owners share responsibility for surfacing these risks during development, not after deployment. Strong validation, fairness audits, and slow rollouts catch most of these failures before they reach customers.
Ethics of Overfitting to the Majority and Underfitting Minority Groups
Looking past pure engineering, the ethics of overfitting and underfitting surface most starkly when models behave unevenly across demographic subgroups. A model trained mostly on data from one majority group can overfit to that group while underfitting on minorities who appear in the long tail. Joy Buolamwini and Timnit Gebru documented this pattern in their 2018 Gender Shades audit, finding face classifier error gaps above thirty percentage points across skin tones. Their findings forced major vendors to retrain and rebalance their datasets. Subgroup audits are now a standard part of responsible deployment for any model that touches people.
Mitigating subgroup overfitting and underfitting requires both technical and process changes that no single algorithm can deliver alone. Engineering teams should evaluate accuracy, calibration, and false-positive rates separately for each protected group. The dangers of AI bias and discrimination overview documents the failure modes that have prompted regulatory action across Europe and the United States. Product teams should also document training-data provenance, intended use, and known limitations so reviewers can challenge the model. Fairness toolkits like Fairlearn and AIF360 provide reweighting and constraint algorithms that complement disciplined evaluation. The goal is not perfect equality but transparent measurement and continuous improvement.
The Future of Generalization: Double Descent, Foundation Models, and AutoML
Looking ahead, the conversation about overfitting vs underfitting in machine learning is shifting as foundation models, double descent, and AutoML challenge classical intuitions. Massively overparameterized neural networks sometimes show a second descent in test error past the point of perfect training fit. The 2019 Belkin paper introduced this phenomenon and demonstrated it across random forests, neural networks, and linear models in their PNAS publication. Engineers must now decide whether their current model sits in the classical regime or the modern overparameterized regime before reasoning about overfitting. The answer changes which tools and heuristics apply to a given problem.
Foundation models like GPT, Claude, and Llama add another layer because they are pretrained on internet-scale data and then fine-tuned on small task-specific sets. Parameter-efficient fine-tuning methods like LoRA can adapt billions of weights using thousands of trainable parameters, dramatically reducing overfitting risk on small downstream datasets. The deep learning versus classical machine learning comparison covers how this paradigm shift changes the regularization story. Underfitting concerns now show up most often in the form of insufficient adapter capacity rather than insufficient base capacity. The mental model has flipped from architect everything to adapt a pretrained giant.
AutoML and neural architecture search now automate much of the capacity-tuning loop, with services like Google Vertex AI AutoML and open-source AutoGluon running the search end to end. These systems try hundreds or thousands of architectures and hyperparameter combinations against cross-validation, surfacing the configurations that generalize best. The the ReLU activation function primer covers the building blocks AutoML systems compose into final models. The classical bias-variance tradeoff still drives the search, but humans now spend more time on data quality and evaluation than on manual architecture choices. The future of fighting overfitting and underfitting is more automated, more data-centric, and more focused on safe deployment.
Chart From AIplusInfo
Where Regularization Tips a Model From Underfit to Overfit
Validation error and train-validation gap by L2 regularization strength on a Ridge regression sweep. Lower validation error and a smaller gap signal a healthier fit.
Source: derived from a Ridge regression sweep over alpha in 1e-3 to 1e3 on synthetic regression data, illustrating the bias-variance curve documented in the scikit-learn underfitting vs overfitting documentation.
Key Insights on Overfitting and Underfitting in Machine Learning
- Srivastava and colleagues reported that adding dropout regularization to deep networks cut MNIST test error from 1.60 percent to 1.25 percent on a standard fully connected baseline.
- Belkin and colleagues showed in their 2019 double descent paper that test error sometimes drops a second time past the interpolation threshold across many models.
- Roberts and colleagues audited 62 pandemic-era COVID detection models in their Nature Machine Intelligence review and found zero clinically usable, mostly because of overfitting and data leakage.
- The 2022 Kaggle State of Data Science survey of 23000 working data professionals ranked scikit-learn the most widely used machine learning library at 81 percent adoption.
- Recht and colleagues rebuilt a fresh CIFAR-10 test set in their 2018 generalization study and found accuracy drops of 4 to 15 percentage points across every studied model.
- Buolamwini and Gebru reported in their Gender Shades commercial face classifier audit error rates of 34.7 percent on darker-skinned women versus 0.8 percent on lighter-skinned men.
- Akiba and colleagues showed in their 2019 Optuna paper that Bayesian search delivered between two and ten times speedups over random search on neural benchmarks.
Pulling these threads together, overfitting and underfitting in machine learning sit at the heart of every modern AI engineering decision from architecture to ethics. Dropout, double descent, and pretraining reshape the classical bias-variance story without erasing it. Field audits of medical and consumer systems show that even high-accuracy benchmarks routinely break in production. Disciplined cross-validation, honest test splits, and subgroup audits are the disciplines that close the gap between paper results and shipped systems. Hyperparameter optimization libraries cut search budgets but cannot replace the human judgment that picks the right metric and dataset split. Building reliable machine learning today still depends on respecting both failure modes and treating generalization as the central engineering problem.
Comparing Strategies for Overfitting vs Underfitting in Machine Learning at a Glance
The table below maps overfitting vs underfitting in machine learning side by side, showing how training error, validation error, root cause, and primary fix differ across the two failure modes. Use it as a fast diagnostic anchor before running expensive cross-validation sweeps. Read each row top to bottom, then match your current model behaviour to the column that fits. The good-fit column shows the target state your tuning loop should converge on. Cross-referencing the diagnostic row tells you which scikit-learn helper to reach for first.
| Dimension | Overfitting | Underfitting | Good Fit Target |
|---|---|---|---|
| Training error | Very low, often near zero on the training split | High and stuck on both training and validation | Low enough to match the inherent data noise floor |
| Validation error | Much higher than training error, often double or more | High and close to training error | Close to training error and below baseline threshold |
| Train-validation gap | Large gap above ten percentage points on balanced data | Small gap because both errors stay high together | Small gap of one to three percentage points typically |
| Root cause | Too much capacity, too little data, no regularization | Too little capacity, missing features, too much regularization | Balanced capacity, sufficient data, calibrated regularization |
| Primary fix | Add regularization, dropout, augmentation, or more training data | Add features, increase model capacity, train longer | Cross-validated tuning of capacity and regularization jointly |
| Production risk | Drift, calibration failure, biased subgroup outcomes | Quiet underperformance, lost revenue, missed opportunities | Stable predictions matching the validation estimate over time |
| scikit-learn diagnostic | validation_curve shows wide score divergence at high capacity | learning_curve plateaus both scores at low values together | Both curves converge to a low joint error across folds |
| Visual signature | U-shaped validation curve with downturn then sharp upturn | Flat curves stuck at high error across all settings | Validation curve bottoms out near a clear minimum value |
Overfitting and Underfitting Examples From Real Machine Learning Projects
Looking past textbook polynomial fits, the clearest examples of overfitting and underfitting in machine learning come from public research and well-documented industrial failures. These cases combine real datasets, named teams, and measurable consequences. They show that even sophisticated practitioners with strong tooling can ship models that fail to generalize. Each example illustrates a different surface, from academic benchmark reproduction to public-health prediction. Together they make a strong case for taking validation discipline seriously throughout the model lifecycle. The three examples below are drawn from published research and journalistic investigations, and each carries a specific lesson about diagnosing the failure mode early.
Recht Reproduction Study on CIFAR-10 and ImageNet Test Sets
Benjamin Recht and colleagues at UC Berkeley rebuilt fresh test sets for CIFAR-10 and ImageNet in 2018 to measure how well state-of-the-art classifiers generalized beyond the original benchmarks. They deployed an identical labeling protocol to the original collection and ran 30 published models against the new sets. According to their arXiv paper on classifier reproduction, accuracy dropped 4 to 15 percentage points on the new CIFAR-10 set. The new ImageNet set showed drops of 11 to 14 points across the studied models. The work suggested years of incremental benchmark gains had partially overfit to test-set quirks. A clear limitation was that the new sets could never be perfectly identical to the originals, so part of the drop reflected sampling differences. Even so, the consistent direction across all 30 models pointed to a real overfitting effect that the field had not measured before.
Google Flu Trends Underfitting and Overfitting Across Seasons
Google launched Flu Trends in 2008 to predict influenza-like illness rates in real time using search query patterns. The team trained a linear model on 50 million queries correlated with CDC data and reported strong early agreement with ground-truth flu reports. According to a 2014 Science paper titled The Parable of Google Flu, the system overestimated peak flu by more than 50 percent during the 2012 to 2013 season. The model had implicitly overfit to historical query-flu correlations that broke when search behavior shifted. A documented limitation was that the original feature selection ran one search across millions of candidates, almost guaranteeing spurious correlations. Google ultimately retired the public version in 2015 and shifted to providing raw data to public-health researchers instead of automated predictions.
Roberts COVID Radiology Review of 62 Models Lost to Overfitting
Michael Roberts and colleagues at the University of Cambridge audited the COVID radiology machine learning literature in 2021 to assess clinical readiness. They screened 2212 published studies and shortlisted 62 papers with sufficient methodological detail for a deeper review. According to the resulting Nature Machine Intelligence COVID model review, every one of the 62 models was rated unsuitable for clinical use. The flagged causes were data leakage, overfitting, and biased data sources. Many studies trained and tested on the same Frankenstein dataset, mixing adult COVID scans with pediatric controls in ways that inflated reported accuracy by double-digit percent. The limitation acknowledged in the paper was that public datasets at the time were small and heterogeneous, which constrained any group attempting reproducible work. The audit became a reference point for stricter reporting standards in medical AI publishing.
Case Studies Where Overfitting and Underfitting Changed Outcomes
Moving beyond research examples, three industrial case studies show how overfitting and underfitting in machine learning translate into multimillion-dollar consequences for real companies. Each case involves a public failure, an after-the-fact technical autopsy, and clear lessons for engineering teams. The stakes ranged from regulatory inquiries to total shutdowns of high-profile product lines. None of the teams involved were inexperienced, which makes the failures all the more instructive. Reading these cases side by side shows that good engineering practice cannot be optional for any production machine learning system. The three case studies below cover real estate, healthcare, and consumer credit, each at the largest possible scale.
Case Study: Zillow Offers Pricing Model Overfit Causes 304 Million Loss
Zillow faced the problem of pricing homes accurately enough to buy them sight unseen at scale through its Zillow Offers iBuying program. The solution combined a deep machine learning pricing model with operations teams that handled inspections and resale after acquisition. According to the CNBC report on the Zillow Offers shutdown, the company recorded a 304 million dollar inventory writedown in Q3 2021. The board shut down the program shortly afterward as part of an enterprise restructuring announcement. Engineers told reporters that the model had failed to keep up with a fast-moving 2021 housing market and consistently overestimated future sale prices. The program had paid above-market prices for roughly 9800 homes that then could not be sold at projected margins.
The controversy was that the model had effectively overfit to historical price-trajectory patterns that did not survive the unusual COVID housing dynamics. Zillow Offers ultimately cut about 25 percent of its workforce as part of the wind-down, affecting roughly 2000 employees. Analysts argued the team had not stress-tested the model against scenarios outside its training distribution before scaling to thousands of acquisitions per month. The limitation acknowledged by leadership was that no human override loop existed to catch systematic price drift in time. The case became a textbook example of the production cost of an overfit model that performs well on historical backtests. Zillow has since shifted its strategy back to listings and tools rather than direct home acquisition.
Case Study: IBM Watson for Oncology Underfit to Real Cancer Cases
IBM faced the problem of recommending personalized cancer treatments across more than 230 hospitals worldwide using its Watson for Oncology platform. The solution trained the model on a relatively small number of synthetic and real cases curated at Memorial Sloan Kettering Cancer Center. According to a STAT News investigation citing internal IBM documents, internal slides revealed that Watson had given multiple unsafe and incorrect treatment recommendations across the platform. The system frequently underfit by failing to capture the diversity of real cancer presentations that clinicians actually encountered. Hospitals in Florida and elsewhere ended pilot deployments after the recommendations failed to align with established clinical practice.
The measurable impact included contract cancellations and a sale of the entire Watson Health unit to private equity firm Francisco Partners in 2022. Industry analysts estimated IBM had invested several billion dollars in the Watson Health initiative before the divestiture. The limitation was that synthetic and curated training cases could not capture the long-tail diversity of real oncology, which left the model underfit on the hardest decisions. Clinicians argued that the marketing had oversold a system that could only handle relatively standard scenarios safely. The case shows that underfitting can be as commercially destructive as overfitting when the stakes are high and the cases are heterogeneous. The lesson for healthcare AI teams is that domain diversity in training data matters as much as model architecture choices.
Case Study: Apple Card Credit Limit Algorithm Triggers New York Investigation
Apple and Goldman Sachs faced the problem of issuing personalized credit lines on the Apple Card across millions of consumer applicants. The solution combined a proprietary machine learning credit decisioning model with manual review for edge cases. According to a Bloomberg report on the New York State investigation, multiple high-profile customers reported credit limit disparities of more than ten times between spouses with similar financial profiles. The state Department of Financial Services opened a formal inquiry into Goldman Sachs within days of the viral complaints reaching social media. Apple co-founder Steve Wozniak publicly reported a 10x disparity for his own account in November 2019, amplifying the regulatory scrutiny.
The controversy was that the model had effectively overfit to features correlated with gender even though gender was never an explicit input. The 2021 NYDFS report concluded that no intentional discrimination occurred, but it required Goldman Sachs to improve transparency and customer recourse processes. The measurable impact included a public apology, a free credit-limit recalculation program, and updated documentation requirements affecting all Goldman card products. The limitation flagged by regulators was that opaque ML decisions cannot easily be explained to applicants when they ask why they were denied. Goldman has since invested 10 million dollars in explainability tooling and broader fairness audits across the Apple Card pipeline. The case illustrates how overfitting to subtle proxy features can create legal and reputational risk even when the underlying model is technically accurate on average.
Frequently Asked Questions on Overfitting and Underfitting in Machine Learning
Overfitting in machine learning happens when a model memorizes training data including noise and stops generalizing to new inputs. Training accuracy climbs near perfect while validation accuracy stalls or collapses. The classic signature is a wide gap between training and validation error on the same dataset. Practitioners fix it with regularization, more data, or smaller capacity.
Underfitting in machine learning happens when a model is too simple to capture the true signal in the data. Both training and validation errors stay high and refuse to drop with extra training. The fix is usually adding capacity, adding richer features, training for more epochs, or lowering aggressive regularization. Underfit models often look acceptable until benchmarked against stronger baselines.
The difference is captured by the train-validation gap on a learning curve. Overfitting shows a large gap with low training error and high validation error together. Underfitting shows both errors high and close to each other because the model never learned the signal. Diagnosing the failure correctly determines whether you should add capacity or add regularization next.
Hyperparameter tuning can help us deal with both overfitting and underfitting by jointly adjusting capacity, regularization strength, learning rate, and optimizer choice. A cross-validated search like GridSearchCV moves the model smoothly between the two regimes. Bayesian methods such as Optuna automate the search with fewer trials than naive grid sweeps. The discipline that matters most is keeping a final test set untouched during tuning.
The standard scikit-learn workflow uses validation_curve to sweep one hyperparameter and learning_curve to measure performance against dataset size. Plotting training and validation scores reveals the classic U-shape that marks overfitting. K-fold cross-validation through cross_val_score gives a stable estimate of generalization error across folds. Engineers usually run all three diagnostics together before tuning any single hyperparameter aggressively across folds.
The bias-variance tradeoff decomposes expected test error into bias squared, variance, and an irreducible noise term. High-bias models underfit because they are too rigid for the data. High-variance models overfit because they memorize noise rather than signal. Tuning capacity, regularization, and dataset size together moves a model toward the sweet spot where the two errors balance.
The most reliable fixes are gathering more data, applying dropout in the 0.1 to 0.3 range, and using early stopping based on validation loss. L2 weight decay and data augmentation add useful complementary signal. Combining these techniques rarely hurts accuracy and often improves generalization by several percentage points. Engineers usually start with early stopping because it is essentially free to enable.
Yes, a model can be globally underfit on average while overfit on specific subgroups or regions of the input space. A face classifier might generalize well on majority groups while overfitting noise on minority subgroups. Subgroup audits catch this pattern when overall accuracy looks fine. Disaggregated metrics across demographic slices are now a standard part of responsible deployment.
Cross-validation rotates the validation split across multiple folds so the model is judged on more than one held-out set. The averaged score reduces the variance of the estimate and reveals overfitting that a single split would miss. Stratified K-fold preserves class balance, and walk-forward validation respects temporal order for forecasting. Nested cross-validation supports joint model selection and hyperparameter tuning without leakage.
There is no universal threshold because the answer depends on model capacity, task difficulty, and label noise. Neural scaling laws show test error falling as a power law with dataset size across many domains. A rough rule for tabular models is at least ten labeled rows per parameter or per feature. Doubling the dataset often beats months of architecture tweaking on validation accuracy.
Data leakage happens when training inputs carry information about the target that would not be available at prediction time. Common patterns include using post-event features, mishandling time-series splits, or letting target-correlated identifiers slip into inputs. The model fits the leaked signal and looks excellent in evaluation while failing in production. Catching leakage requires explicit feature-by-feature review and conservative cross-validation strategies.
Not necessarily, because modern overparameterized networks sometimes show double descent where test error drops again past the interpolation threshold. Adding layers without adding data or regularization often does cause overfitting on small datasets. On large datasets with strong regularization, deeper networks frequently improve generalization. The right answer depends on dataset size and regularization budget together.
Early stopping monitors validation loss during training and halts when it stops improving for a configurable patience window. The technique prevents the network from continuing to fit noise after the useful learning phase. Combined with model checkpointing, it preserves the best-performing weights even if later epochs drift. Early stopping is almost free to enable and is supported by every major framework.