Introduction: How to Use Linear Regression in Machine Learning?
In the field of statistics and machine learning, linear regression is probably one of the most well-known and well-understood algorithms. We will examine the linear regression algorithm, how it works and how to use it as efficiently as possible in the machine learning projects you are working on. Here is why linear regression belongs to both statistics and machine learning.
- There are many names by which linear regression is known.
- Representation and learning algorithms that are used to create a linear regression model have many names.
- Prepare your data when modeling using linear regression.
What is linear regression?
We need to get accustomed to regression before we can understand linear regression. A regression is a way to model a target value based on independent predictors. The method is primarily used for forecasting and determining cause and effect relationships between variables. Generally, regression techniques differ in terms of the number of independent variables and the type of relationship between the independent and dependent variables.
The linear regression algorithm is designed to find the best value for b_0 and b_1. For a better understanding of linear regression, let’s look at what its used for and what are its concepts.
What are business applications of linear regression?
Linear regression can be used to solve a variety of business prediction problems including:
Predict future prices/costs.
You can use linear regression to predict future prices and costs. For example how much will steel cost me in 6 months?
Predict future revenue.
The linear regression technique can be used to model your data, understand customer acquisition cost, and long term value to predict revenue.
Compare and understand how hour new product line is doing.
There are a number of reasons why linear regression is one of the most widely used algorithms for machine learning, including its effectiveness in answering hard business questions.
Also Read: What is Argmax in Machine Learning?
Isn’t Linear Regression from Statistics?
As we dive into linear regression details, you may ask yourself why we are investigating this algorithm.
Isn’t it a technique from statistics?
Predictive modeling, more specifically machine learning, is primarily concerned with minimizing the error of a model or making the most accurate predictions possible at the expense of explainability. Applied machine learning uses algorithms from many fields, including statistics, to accomplish its goals.
In statistics, linear regression was developed as a model for understanding the relationship between input and output numerical variables, but it has been borrowed by machine learning. In other words, it is a statistical and machine learning algorithm.
We will now review some of the common names for linear regression models.
Linear regression has many names.
Things can get a little confusing when you begin to learn about linear regression.
This is due to the fact that linear regression has been around for so long (more than 200 years). Every possible angle has been used to study it, and often, each angle has its own new and unique name.
A linear regression model describes a relationship between two variables, such as the input variables (x) and one output variable (y), in which the linear relationship is assumed. In more detail, it can be shown that y can be calculated from a linear combination of the input variables (x).
The simple linear regression method is used whenever there is only one input variable (x) in the equation. It is not uncommon for the literature from statistics to refer to the method as multiple linear regression when there are multiple input variables.
The procedure of preparing or training the linear regression equation from data can be done using different techniques, the most common of which is called Ordinary Least Squares. In this way, it is common to refer to a model prepared in this way as Ordinary Least Squares Linear Regression or simply as Least Squares Regression.
After we know some terms used to describe linear regression, we will take a closer look at the representation of linear regression.
When and why do you use Regression?
A regression analysis is performed when the dependent variable is continuous and the predictor or independent variables are of any type of data, such as continuous, nominal or categorical. When using regression analysis, you are trying to find the best fit line that shows how the dependent variable is related to the predictors with the least amount of error.
Regression consists of an independent variable, a coefficient, and an error term that determines the output/dependent variable. Regression can be used using machine learning models.
Linear Regression Model Representation
A simple representation makes linear regression an attractive model.
An example of a representation is a linear equation that combines a specific set of input values (x) with a solution based on the output (y) predicted based on those input values. As such, both the input value (x) and the output value (y) are numeric values.
Whenever an input value or column is input into a linear equation, it is assigned a scale factor, termed a coefficient, and represented by the capital Greek letter Beta (B). There is also an additional coefficient which gives the line an additional degree of freedom (e.g. moving up and down on a two-dimensional plot), and it is often known as the intercept or the bias coefficient.
The form of a regression model, for example, in a simple regression problem (a single x and a single y), would be:
y = B0 + B1*x
The line in higher dimensions is called a plane or a hyperplane when there is more than one input (x). The representation is therefore the equation’s form and the coefficient values (e.g. B0 and B1 in the above example).
It is common to talk about the complexity of a regression model such as linear regression. In other words, the number of coefficients used in the model.
If a coefficient becomes zero, it effectively removes the influence of the input variable on the model and, therefore, from the prediction made by the model (0 * x = 0). Regularization methods change the learning algorithm in order to reduce regression model complexity by reducing the sizes of the coefficients, trying to drive some to zero.
Let’s review some ways we can learn this representation from data now that we understand the representation used for a linear regression model.
Linear Regression Learning the Model
The process of learning a linear regression model involves estimating the coefficients from the available data to be used in the representation.
The purpose of this section is to give a brief introduction to four techniques for preparing a linear regression model. Despite the fact that this information is not enough to implement them from scratch, it will help to get an understanding of the computations and trade-offs involved.
The model has been studied thoroughly, so there are many more techniques to explore. The Ordinary Least Squares (OLS) method should be noted because it is the most common one used in general. Please take note that Gradient Descent is also one of the techniques that are commonly taught in machine learning classes.
Simple Linear Regression
In simple linear regression, we can estimate the coefficients using statistics when we have a single input.
From the data, you will need to calculate statistical properties such as means, standard deviations, correlations and covariance. To traverse and calculate the statistics, you must have all of the data available.
Ordinary Least Squares
Ordinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model.
To estimate the values of the coefficients when we have a number of inputs, we can use Ordinary Least Squares (OLS).
By using the Ordinary Least Squares algorithm, we are seeking to minimize the sum of squared residuals. It means that, given a regression line through the data, we are going to calculate the distance between each point on the data and the regression line, square it, and then add all the squared errors together. Ordinary least squares seeks to minimize the quantity that is described above.
In this technique, the data is treated as a matrix, and linear algebra operations are used to estimate the optimal values for the coefficients. There is a requirement that all the data be available and there is a requirement that you have enough memory to store the data and perform matrix operations.
If you wish to implement the Ordinary Least Squares algorithm on your own, you usually do so as an exercise in linear algebra. It is more likely that you will be calling a procedure from a library of linear algebra functions. It is very fast to calculate the result with this procedure.
There is a process of optimizing the values of the coefficients by iteratively minimizing the error of your model on your training data when there are one or more inputs.
In this particular case, the operation is called Gradient Descent, which starts off with random values for each coefficient. Each pair of input and output values is added together to get the sum of the squared errors. In this algorithm, a learning rate is used as the scale factor and the coefficients of the learning rate are updated in the direction of minimizing the error. A minimum sum squared error is reached or there is no further improvement possible if the process is repeated.
To use the learning rate (alpha) method, you must choose a parameter that will determine the size of the improvement step that each iteration of the procedure will take.
The concept of gradient descent is often taught using a linear regression model because it is relatively straight-forward to understand. Typically, it is useful when you have a very large dataset either in terms of its number of rows or its number of columns that may not fit in your memory.
There are extensions of the training of the linear model called regularization methods. It is designed both to minimize the sum of the squared errors of the model on the training data (using ordinary least squares), as well as to reduce the complexity of the model (for instance reducing the number or absolute size of the sum of all coefficients in the model).
The following are two popular examples of regularization procedures for linear regression:
- Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).
- Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).
When your input values are collinear and ordinary least squares overfit your training data, these methods are effective.
Having learned some techniques to learn a linear regression model’s coefficients, let’s examine how a model can be used to make predictions on new data.
Preparing Data For Linear Regression
There has been much study on the topic of linear regression, and there is a lot of literature on how your data must be structured in order to make the best use of the model.
As a result, when talking about these expectations and requirements there is a lot of sophistication involved. This can be intimidating for some people. The following rules can be used more as thumb rules when using Ordinary Least Squares Regression, which is the most common implementation of linear regression in practice.
Using these heuristics, you can try different preparations of the data and see what works best for your problem.
- Linear Assumption. The linear regression method assumes a linear relationship between input and output. It does not support anything else. Although this seems obvious, it is important to keep in mind when you have a lot of attributes. To make an exponential relationship linear, data may need to be transformed (e.g. log transformation).
- Remove Noise. When using linear regression, your input and output variables are assumed to be noise-free. Consider using data cleaning operations in order to expose and clarify your data’s signal so that you can better understand and improve your analysis. Ideally, you should remove any outliers from the output variable (Y) if at all possible. This is most important for the output variable (Y).
- Remove Collinearity. If the input variables for your regression are highly correlated, the linear model will overfit your data. Considering calculating pairwise correlations between your input data and removing the most correlated data.
- Gaussian Distributions. When your input and output variables follow Gaussian distributions, linear regression makes more reliable predictions. The use of transforms (e.g. log or BoxCox) on your variables might help you make their distributions more Gaussian-looking.
- Rescale Inputs: Standardization or normalization will often make linear regression predictions more reliable.
Your model as defined above uses the default values of all parameters.
Cost function allows us to determine the best possible values for b_0 and b_1 that would provide the best fit line for the data points. As we want to find the best values for b_0 and b_1, we convert this search problem into a minimization problem in which we would like to minimize the difference between the predicted value and the actual value.
The function above is used to minimize the difference. Difference between predicted values and ground truth measures error. In order to calculate the error difference, we square the error over all data points and sum this value over all data points and divide it by the total number of data points. In other words, we get the average squared error over all the data points. As a result, the Mean Squared Error (MSE) function is referred to as a cost function as well. Using this MSE function, we are now going to change the values of b_0 and b_1 in order to make sure that the MSE value settles at the minimum. You need to keep in mind the absolute error, which should be measured in linear function to minimize.
Gradient descent is one of the concepts you need to understand in order to grasp linear regression. By using the gradient descent method, we can update the parameters b_0 and b_1 in order to reduce the cost function (MSE). It is our intention to start with some values for b_0 and b_1 and then we try to reduce their costs iteratively by changing these values. Our method of changing the values is known as gradient descent.
An analogy would be to think about a pit like a U that runs down the center of the pit. You will stand at the topmost point of the pit and strive to reach the bottom. In order to reach the bottom, you need to take a certain number of steps in order to reach it. Taking one step at a time would eventually enable you to get to the bottom of the hole but it would take a longer period of time. By taking longer steps each time, you could reach the bottom of the pit sooner, but there is a possibility that you could overshoot the bottom of the pit and not necessarily get there exactly. In a gradient descent algorithm, the number of steps you take is what is called the learning rate. The algorithm’s convergence rate is determined by how quickly the algorithm converges to the minima.
There are times when the cost function can be a non-convex function, at which place you could settle at a local minima, but for linear regression, it is always a convex function.
There may be a question in your mind about how to use gradient descent in order to update b_0 and b_1. We take gradients calculated from the cost function to update b_0 and b_1. In order to determine these gradients, we take partial derivatives with respect to b_0 and b_1. It would be helpful for you if you had a basic knowledge of calculus to understand how partial derivatives are calculated below, but if you do not, it is still fine. Feel free to take it as it is.
The partial derivates are the gradients, which are used to update the values of b_0 and b_1. The learning rate is a hyperparameter that needs to be specified, and it is called Alpha. The difference between a smaller learning rate and a larger learning rate is that a smaller learning rate is likely to get you closer to the minima, but it takes a much longer time to reach the minima, while a larger learning rate will converge sooner, but you may overshoot the minima.
Positive Linear Relationship
A positive linear relationship is one in which the dependent variable expands on the Y-axis and the independent variable advances on the X-axis.
Negative Linear Relationship
A negative linear relationship occurs when the dependent variable decreases on the Y-axis and the independent variable increases on the X-axis.
By using linear regression, the goal of the algorithm is to get the best values for b_0 and b_1 in order to find the best fitting line. The best fitting line should have the lowest error, which means that the difference between predicted and actual values should be minimized.
Beyond Linear Regression
There are times when linear regression is not appropriate, especially in the case of nonlinear models of high complexity.
It is fortunate that there are other regression techniques suitable for the cases where linear regression does not work well. Support vector machines, decision trees, random forests, and neural networks are some of these types of machines.
The Python programming language offers many libraries for regression that make use of these techniques. They are primarily open-source libraries and are available for free. In fact, that’s one of the reasons why Python is one of the most popular programming languages for machine learning.
It is possible to use other regression techniques in a manner very similar to what you have seen using the package scikit-learn. The package includes classes for support vector machines, decision trees, random forests, and many more, each with the methods .fit(), .predict(), .score(), and so on.
Regression coefficients are extremely important.
What are the benefits of linear regression for deployment in production?
When it comes to production data science settings, linear regression is the most popular choice due to the many benefits it provides:
Ease of use.
The model is easy to implement from a computational perspective. As far as engineering overhead goes, it does not require a lot of technical knowledge either before launch or during maintenance.
Linear regression is simpler to interpret than deep learning models (neural networks). With such a mechanism, the machine learning algorithm surpasses black-box models, which do not explain the reasons for the change in output variable as a result of the input variable.
The linear regression algorithm does not take up a lot of computational power, and so it is perfect for the use cases where scaling is expected. As the volume of data (big data) and velocity of data increases, it scales well as well.
Since linear regression has the advantage of being easy to compute, it can be applied to online settings, as the model can be retrained by the addition of each new example and predictions are generated in near real-time. In contrast, computationally heavy approaches like neural networks or support vector machines are not suitable for real-time applications (or at least are very expensive) since they require a lot of computing resources or lengthy waiting times in order to retrain their models on new data.
These specific features explain why linear regression is one of the most reliable models for making predictions with machine learning.
Linear regression, alongside logistic regression, is one of the most widely used machine learning algorithms in real production settings.
Also Read: 10 Ways How RPA Can Boost Your Business.
The linear regression algorithm is an algorithm that every Machine Learning enthusiast needs to learn, and it is also the perfect starting point for people who are interested in learning Machine Learning. Despite its simplicity, it is actually a very useful algorithm.
I would say that the main benefit of linear regression over any other method is that it is simple and intuitive. It is often used because it is easier to understand than nonlinear regression algorithms.
Linear regression is usually the first step when developing a predictive model. This is because it is easy to implement and understand. It is easy to understand the relationship between variables by scatter plotting the data using scatter plot method. Complex linear regressions can be simplified by using dummy variables and loss function but their over use is to be avoided.