← Back to Blog
Machine LearningLinear RegressionOptimization

Linear Regression

When and how to use it, and how to optimize it

Jul 22, 2022·10 min read

Linear regression

Linear regression is the entry door for those who are studying machine learning. The algorithm is easy to understand and is very straightforward to implement. In this article we will learn when and how to use linear regression on our data, and also how to optimize it.

When to use linear regression

As the name suggests, linear regression is an algorithm that has to be used on linear data. Of course you can use it on nonlinear data, but, as you will see down below, it might not have a good performance on fitting nonlinear data, because it can't capture nonlinearity.

The problem

Suppose we have two vectors of data points, x\bold{x} and y\bold{y}, where xRN\bold{x} \in \mathbb{R}^{N} and yRN\bold{y} \in \mathbb{R}^{N}. Let's assume here that x\bold{x} and y\bold{y} are linearly correlated and we want to predict y\bold{y} given x\bold{x}. We know that the most general form of linear dependence of one variable is given by

y=wx+by = wx + b

where yy is the output variable, ww is the weight (also known as angular coefficient), xx is the input variable, and bb is the bias (also known as linear coefficient). Remembering that x\bold{x} and y\bold{y} are vectors of size NN, we can represent equation (1) in the following vectorial form:

[y1y2y3yN]=w[x1x2x3xN]+b[1111]\begin{bmatrix}y_{1}\\y_{2}\\ y_{3}\\ \vdots\\ y_{N}\end{bmatrix}=w\begin{bmatrix}x_{1}\\x_{2}\\ x_{3}\\ \vdots\\ x_{N}\end{bmatrix}+b\begin{bmatrix}1\\1\\ 1\\ \vdots\\ 1\end{bmatrix}

Representing vectors with bold letters we have:

y=wx+b\bold{y} = w\bold{x} + \bold{b}

For simplicity, suppose that x\bold{x} and y\bold{y} are already the train data so we don't need to use any subscripts. Denoting y^\hat{\bold{y}} the vector of predictions and y\bold{y} the vector of true values, the objective here is to find ww such that the error between y\bold{y} and y^\hat{\bold{y}} is minimal. The idea here is to use a function that depends on both y^\hat{\bold{y}} and y\bold{y}, and then minimize it. There are several possibilities out there, but we will choose the mean squared error, which in this case is given by

E(y^,y)=1Ny^y22\mathsf{E}(\hat{\bold{y}}, \bold{y}) = \frac{1}{N} \|\hat{\bold{y}} - \bold{y}\|_{2}^{2}

in vectorial form. Note that we are using the L2L^{2} norm vector to calculate the distance between y^\hat{\bold{y}} and y\bold{y}. Sometimes it is convenient to represent the E\mathsf{E} function using the components of vector y^\hat{\bold{y}} and y\bold{y}:

E(y^,y)=1Ni=1Ny^iyi22\mathsf{E}(\hat{\bold{y}}, \bold{y}) = \frac{1}{N} \sum_{i = 1}^{N}\|\hat{y}_{i} - y_{i}\|_{2}^{2}

Minimizing the function E

To minimize E\mathsf{E} we first write y^\hat{\bold{y}} as

y^=wx+b,\hat{\bold{y}} = w \bold{x} + \bold{b},

where x\bold{x} is the train input and b\bold{b} is the bias. Using equation (5) on the error E\mathsf{E} function we have

E=1Nwx+by22.\mathsf{E} = \frac{1}{N} \|w\bold{x} + \bold{b} - \bold{y}\|_{2}^{2}.

We know that for a vector u\bold{u}, we can write its L2L^{2} norm as

u2=uTu.\|\bold{u}\|_{2} = \sqrt{\bold{u}^{T}\bold{u}}.

Using this result we can write the error function in the following way:

E=1N(wx+b)T(wx+b).\mathsf{E} = \frac{1}{N} (w \bold{x} + \bold{b})^{T}(w \bold{x} + \bold{b}).

Developing the product of the terms between parenthesis we have

E=1N[(wx)Twx+(wx)Tb(wx)Ty+bTwx+bTbbTyyTwxyTb+yTy].\mathsf{E}=\frac{1}{N}\left[(w\bold{x})^{T}w\bold{x}+(w\bold{x})^{T}\bold{b}-(w\bold{x})^{T}\bold{y}+\bold{b}^{T}w\bold{x}+\bold{b}^{T}\bold{b}-\bold{b}^{T}\bold{y}-\bold{y}^{T}w\bold{x}-\bold{y}^{T}\bold{b}+\bold{y}^{T}\bold{y}\right].

After making simplifications we arrive at the following form for the error:

E=1N[xTxw2+2xTbw2xTyw+bTb2yTb+yTy].\mathsf{E}=\frac{1}{N}\left[\bold{x}^{T}\bold{x}w^{2}+2\bold{x}^{T}\bold{b}w-2\bold{x}^{T}\bold{y}w+\bold{b}^{T}\bold{b}-2\bold{y}^{T}\bold{b}+\bold{y}^{T}\bold{y}\right].

To find ww that minimizes E\mathsf{E} we first have to pay attention on some aspects of the function E\mathsf{E}. First, we see that E\mathsf{E} is a quadratic function of ww, and we see that the coefficient of w2w^{2} is greater than or equal to zero because the magnitude of a vector is never negative. Because of this result, if we take the derivative of E\mathsf{E} with respect to ww and make this equal to zero, the value of ww that we will find will be precisely the value that minimizes E\mathsf{E}. To execute this calculation, first we have:

dEdw=0.\frac{\mathrm{d} \mathsf{E}}{\mathrm{d} w}=0.

Observing equation (8) and ignoring the terms that do not depend on ww, we have:

1N[ddw(xTxw2)+ddw(2xTbw)ddw(2xTyw)]=0.\frac{1}{N}\left[\frac{\mathrm{d}}{\mathrm{d} w}(\bold{x}^{T}\bold{x}w^{2})+\frac{\mathrm{d}}{\mathrm{d} w}(2\bold{x}^{T}\bold{b}w)-\frac{\mathrm{d}}{\mathrm{d} w}(2\bold{x}^{T}\bold{y}w)\right]=0.

After computing the derivatives we have:

2xTxw+2xTb2xTy=0.2\bold{x}^{T}\bold{x}w+2\bold{x}^{T}\bold{b}-2\bold{x}^{T}\bold{y}=0.

Solving for ww we find:

w=xTyxTbxTx.w = \frac{\bold{x}^{T}\bold{y}-\bold{x}^{T}\bold{b}}{\bold{x}^{T}\bold{x}}.

We can write this result in the form:

w=xTx22(yb).w=\frac{\bold{x}^{T}}{\|\bold{x}\|_{2}^{2}}(\bold{y}-\bold{b}).

Particular case: bias equal to zero

If we consider b=0b = 0, we get the particular case:

w=xTyx22.w = \frac{\bold{x}^{T}\bold{y}}{\|\bold{x}\|_{2}^{2}}.

One interesting thing about the equation (11) is that we can write ww in terms of the angle θ\theta between x\bold{x} and y\bold{y}. First we notice that

xTy=x2y2cos(θ).\bold{x}^{T}\bold{y}=\|\bold{x}\|_{2}\|\bold{y}\|_{2}\cos(\theta).

Using this relation we can rewrite ww in the following form:

w=y2x2cos(θ).w=\frac{\|\bold{y}\|_{2}}{\|\bold{x}\|_{2}}\cos(\theta).

This is a very interesting result! This form of ww tells us that its absolut size depends on how big the vector y\bold{y} is compared to the vector x\bold{x}. The cos(θ)\cos(\theta) determines the sign of ww. Although this form of ww is less convenient for practical purposes, it tells more about the form of ww. Finally, if we plug the result for ww (without bias) in the original equation, we have:

y^=xTyx22x+b.\hat{y}=\frac{\bold{x}^{T}\bold{y}}{\|\bold{x}\|_{2}^{2}}x+b.