Linear regression
Linear regression is the entry door for those who are studying machine learning. The algorithm is easy to understand and is very straightforward to implement. In this article we will learn when and how to use linear regression on our data, and also how to optimize it.
When to use linear regression
As the name suggests, linear regression is an algorithm that has to be used on linear data. Of course you can use it on nonlinear data, but, as you will see down below, it might not have a good performance on fitting nonlinear data, because it can't capture nonlinearity.
The problem
Suppose we have two vectors of data points, and , where and . Let's assume here that and are linearly correlated and we want to predict given . We know that the most general form of linear dependence of one variable is given by
where is the output variable, is the weight (also known as angular coefficient), is the input variable, and is the bias (also known as linear coefficient). Remembering that and are vectors of size , we can represent equation (1) in the following vectorial form:
Representing vectors with bold letters we have:
For simplicity, suppose that and are already the train data so we don't need to use any subscripts. Denoting the vector of predictions and the vector of true values, the objective here is to find such that the error between and is minimal. The idea here is to use a function that depends on both and , and then minimize it. There are several possibilities out there, but we will choose the mean squared error, which in this case is given by
in vectorial form. Note that we are using the norm vector to calculate the distance between and . Sometimes it is convenient to represent the function using the components of vector and :
Minimizing the function E
To minimize we first write as
where is the train input and is the bias. Using equation (5) on the error function we have
We know that for a vector , we can write its norm as
Using this result we can write the error function in the following way:
Developing the product of the terms between parenthesis we have
After making simplifications we arrive at the following form for the error:
To find that minimizes we first have to pay attention on some aspects of the function . First, we see that is a quadratic function of , and we see that the coefficient of is greater than or equal to zero because the magnitude of a vector is never negative. Because of this result, if we take the derivative of with respect to and make this equal to zero, the value of that we will find will be precisely the value that minimizes . To execute this calculation, first we have:
Observing equation (8) and ignoring the terms that do not depend on , we have:
After computing the derivatives we have:
Solving for we find:
We can write this result in the form:
Particular case: bias equal to zero
If we consider , we get the particular case:
One interesting thing about the equation (11) is that we can write in terms of the angle between and . First we notice that
Using this relation we can rewrite in the following form:
This is a very interesting result! This form of tells us that its absolut size depends on how big the vector is compared to the vector . The determines the sign of . Although this form of is less convenient for practical purposes, it tells more about the form of . Finally, if we plug the result for (without bias) in the original equation, we have: