Overfitting is a major problem for machine learning models. Many newer data scientists may fall victim to it. So, what is overfitting? Well, we can start with an example. Let’s say we want to predict if a specific product will sell during the summertime based on previous sales data of a similar product. So, we start by training the model from a dataset of 10000 sales data and their outcomes. When using this model on the original data set, it manages to predict the outcomes with 99% accuracy. This makes it seem like a good model. However, after running it on a new unseen dataset of sales data, the model is only now 50% accurate. So, what happened? Well, this means that the model does not generalize well from the training data to unseen data. This problem is known as overfitting, and it can be tough for new data scientists to overcome.
When coming up with a predictive model, the “signal” would be the actual underlying pattern that you are trying to learn from the data. The “noise” is the randomness or other irrelevant data from the dataset. Let’s say you decide to model height vs age in children. With a large enough sample size, you would easily find a relationship between the two. This would be the signal.
On the other hand, if we tried to create a model with the sample size only being one local school, the relationship might not be as evident. This is because of outliers, like a kid with tall parents, and other randomness. Noise causes interference in signals. A machine learning algorithm that functions well will have to separate the signal from the noise. If an algorithm has too many input features or if its not regularized properly then it may end up memorizing the noise instead of finding the signal. This would cause the model to make predictions based on the noise instead of the signal, which makes it very good on the training data but will fail on new and unseen data. This makes it an overfit model.
Before looking at overfitting and its counterpart underfitting, we need to look at what is meant by goodness of fit. Goodness of fit is a statistics term that is used to define how closely a model’s predicted values match the true observed values. A model would be considered overfit if it has good fit with training data but poor fit with new and unseen datasets.
Now to better understand overfit, we can look at the opposite problem which is known as underfitting. Unlike overfitting, underfitting occurs when the model created is too simple. This means that it is informed by too few features, and it was regularized too much, thus making it inflexible when trying to learn the dataset.
There is a famous concept called the Bias-Variance Tradeoff which states that simple learners will tend to have less variance in their predictions, but they will be more biased towards wrong outcomes. On the other hand, a complex learner will tend to have more variance in their predictions. Keep in mind variance is simply a measure of how far each number in the dataset is away from the average and from every other number in the dataset. Low variance is ideal because it means that you can better predict information about the dataset based on sample data. High variance means that the values are less consistent, so it’s harder to make predictions.
Both bias and variance are used as forms of prediction error in machine learning, and they can affect each other. This means that if you try to reduce error from bias it may result in increasing the error from variance. This inverse relationship is a key concept when it comes to machine learning algorithms.
Detecting Overfit in Machine Learning
Now that we know about overfitting, how do we detect it? Well, the short answer is that we can’t. This is because we don’t know how well our model will perform on new data until we test it. To try and address this issue data scientists will split the initial dataset into a training set and a test set. The training set is used to train the model that was created. The test set will be unseen new data that we can use our model on.
With this method we can get a general approximation of how well our model will do on new data. If our model does a lot better on the training set over the test set, then there is a high probability that it is overfitting.
For example, if the model we created was 99 percent accurate on the training data, but only 50 percent accurate on the test data then it can be assumed that our model is overfit. Another possible method to detecting overfit is to start with a very simple model. Keep in mind that a simple model will have high bias and low variance. Then this simple model can be used as a reference point when more complex algorithms are used. This idea is mostly used in Python’s SciKit-Learn library.
Preventing Overfit in Machine Learning
Now that we have some ideas on how to detect when our model is overfit, it still does not solve the problem on how to prevent it from happening in the first place. Fortunately, there are a few options to try to prevent overfit.
The most popular of which is known as cross-validation. The basic idea of cross-validation is to use the initial training data to generate multiple mini train-test splits. Then these splits are used to tune the model that is being created. In a standard k-fold cross validation we partition the data into folds. Folds can be thought of as subsets of data. In a k-fold cross validation there will be k subsets. Afterwards, the algorithm is trained on k-1 folds while the remaining folds are used as the test set. The test set here will sometimes also be referred to as the holdout fold.
Cross-validation also allows you to tune hyperparameters with only the original training set. Keep in mind that hyperparameters, in the context of machine learning, refers to a parameter whose value is used to control the learning process. By being able to tune hyperparameters you will be able to keep the test set as unseen data when choosing your final model. Now let’s break down choosing hyperparameters into a step-by-step method for easier understanding, as it is important.
Cross-Validation choosing Hyperparameters
In the following example we will be doing a 10-fold cross-validation. There are seven steps in the process and can be seen below:
Split training data into 10 folds.
Choose a set of hyperparameters from all the sets of hyperparameters that you want to consider.
When you have chosen a set of hyperparameters, train your model with them but only on the first 9 folds.
Evaluate the model on the 10th
Repeat steps 3 and 4, but hold out on a different fold each time
Aggregate your performance across all 10 folds. This is what you will use as a performance metric for the set of hyperparameters you have chosen.
Repeat steps 2 through 6 for all the hyperparameter sets you have chosen
Overfitting is a major problem in the machine learning world. However cross validation is a very clever way to get around this problem by reusing training data by dividing it into parts and cycling between them. The most popular form is known as k-fold cross-validation. To truly master cross-validation I would recommend getting hands on practice with it and seeing how it is used to solve real world problems. Thank you for reading this article.
References
Brownlee, Jason. Better Deep Learning: Train Faster, Reduce Overfitting, and Make Better Predictions. Machine Learning Mastery, 2018.
de Prado, Marcos Lopez. Advances in Financial Machine Learning. John Wiley & Sons, 2018.
Kuhn, Max, and Kjell Johnson. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press, 2019.
VanderPlas, Jake. Python Data Science Handbook: Essential Tools for Working with Data. “O’Reilly Media, Inc.,” 2016.
Introduction
Overfitting is a major problem for machine learning models. Many newer data scientists may fall victim to it. So, what is overfitting? Well, we can start with an example. Let’s say we want to predict if a specific product will sell during the summertime based on previous sales data of a similar product. So, we start by training the model from a dataset of 10000 sales data and their outcomes. When using this model on the original data set, it manages to predict the outcomes with 99% accuracy. This makes it seem like a good model. However, after running it on a new unseen dataset of sales data, the model is only now 50% accurate. So, what happened? Well, this means that the model does not generalize well from the training data to unseen data. This problem is known as overfitting, and it can be tough for new data scientists to overcome.
Table of contents
Signal vs Noise
When coming up with a predictive model, the “signal” would be the actual underlying pattern that you are trying to learn from the data. The “noise” is the randomness or other irrelevant data from the dataset. Let’s say you decide to model height vs age in children. With a large enough sample size, you would easily find a relationship between the two. This would be the signal.
On the other hand, if we tried to create a model with the sample size only being one local school, the relationship might not be as evident. This is because of outliers, like a kid with tall parents, and other randomness. Noise causes interference in signals. A machine learning algorithm that functions well will have to separate the signal from the noise. If an algorithm has too many input features or if its not regularized properly then it may end up memorizing the noise instead of finding the signal. This would cause the model to make predictions based on the noise instead of the signal, which makes it very good on the training data but will fail on new and unseen data. This makes it an overfit model.
Also Read: How Artificial Intelligence Chooses The Ads You See
Goodness of Fit and Underfitting
Before looking at overfitting and its counterpart underfitting, we need to look at what is meant by goodness of fit. Goodness of fit is a statistics term that is used to define how closely a model’s predicted values match the true observed values. A model would be considered overfit if it has good fit with training data but poor fit with new and unseen datasets.
Now to better understand overfit, we can look at the opposite problem which is known as underfitting. Unlike overfitting, underfitting occurs when the model created is too simple. This means that it is informed by too few features, and it was regularized too much, thus making it inflexible when trying to learn the dataset.
There is a famous concept called the Bias-Variance Tradeoff which states that simple learners will tend to have less variance in their predictions, but they will be more biased towards wrong outcomes. On the other hand, a complex learner will tend to have more variance in their predictions. Keep in mind variance is simply a measure of how far each number in the dataset is away from the average and from every other number in the dataset. Low variance is ideal because it means that you can better predict information about the dataset based on sample data. High variance means that the values are less consistent, so it’s harder to make predictions.
Both bias and variance are used as forms of prediction error in machine learning, and they can affect each other. This means that if you try to reduce error from bias it may result in increasing the error from variance. This inverse relationship is a key concept when it comes to machine learning algorithms.
Detecting Overfit in Machine Learning
Now that we know about overfitting, how do we detect it? Well, the short answer is that we can’t. This is because we don’t know how well our model will perform on new data until we test it. To try and address this issue data scientists will split the initial dataset into a training set and a test set. The training set is used to train the model that was created. The test set will be unseen new data that we can use our model on.
With this method we can get a general approximation of how well our model will do on new data. If our model does a lot better on the training set over the test set, then there is a high probability that it is overfitting.
For example, if the model we created was 99 percent accurate on the training data, but only 50 percent accurate on the test data then it can be assumed that our model is overfit. Another possible method to detecting overfit is to start with a very simple model. Keep in mind that a simple model will have high bias and low variance. Then this simple model can be used as a reference point when more complex algorithms are used. This idea is mostly used in Python’s SciKit-Learn library.
Preventing Overfit in Machine Learning
Now that we have some ideas on how to detect when our model is overfit, it still does not solve the problem on how to prevent it from happening in the first place. Fortunately, there are a few options to try to prevent overfit.
The most popular of which is known as cross-validation. The basic idea of cross-validation is to use the initial training data to generate multiple mini train-test splits. Then these splits are used to tune the model that is being created. In a standard k-fold cross validation we partition the data into folds. Folds can be thought of as subsets of data. In a k-fold cross validation there will be k subsets. Afterwards, the algorithm is trained on k-1 folds while the remaining folds are used as the test set. The test set here will sometimes also be referred to as the holdout fold.
Cross-validation also allows you to tune hyperparameters with only the original training set. Keep in mind that hyperparameters, in the context of machine learning, refers to a parameter whose value is used to control the learning process. By being able to tune hyperparameters you will be able to keep the test set as unseen data when choosing your final model. Now let’s break down choosing hyperparameters into a step-by-step method for easier understanding, as it is important.
Cross-Validation choosing Hyperparameters
In the following example we will be doing a 10-fold cross-validation. There are seven steps in the process and can be seen below:
Also Read: Role of artificial intelligence in transportation.
Conclusion
Overfitting is a major problem in the machine learning world. However cross validation is a very clever way to get around this problem by reusing training data by dividing it into parts and cycling between them. The most popular form is known as k-fold cross-validation. To truly master cross-validation I would recommend getting hands on practice with it and seeing how it is used to solve real world problems. Thank you for reading this article.
References
Brownlee, Jason. Better Deep Learning: Train Faster, Reduce Overfitting, and Make Better Predictions. Machine Learning Mastery, 2018.
de Prado, Marcos Lopez. Advances in Financial Machine Learning. John Wiley & Sons, 2018.
Kuhn, Max, and Kjell Johnson. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press, 2019.
VanderPlas, Jake. Python Data Science Handbook: Essential Tools for Working with Data. “O’Reilly Media, Inc.,” 2016.
Share this: