# Linear Regression

Author: Divij Handa

Regression is a supervised machine learning algorithm, where we want to develop a mathematical formula to best fit our problem. Given a data set, we want to know the exact function that describes the relationship between the input features and desired output. One such form of regression is Linear Regression. Linear regression as the name suggests is when we try to fit a linear line (of the form y= θ₀+θ₁x) to our data. For example, for a classroom, the marks of a student are as follows

We can create multiple linear models by changing the value of parameters, i.e. θ₀ and θ₁, to get a different line. Two such lines are represented in the following figure (in red and green).

To figure out which line is the best fit for our data, we need to define an error term (a.k.a. cost function). Cost function is the measurement of error defined by our model. By performing linear regression, we aim to minimize the value of cost function (we want our error to be minimum). One popluar cost function is Residual Sum of Squares (RSS). RSS is defined be taking the L2 distance (Eucledian distance) of the actual points and the given model. Following is the mathematical formula to

calculate RSS $L = \frac{1}{2n_{train}} \sum_{i = 1}^{n_{train}}\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^{2}$ ,

where

$n_{train}$ is the number of training examples,

$y^{(i)}$ is the actual value (given data) and

$hθ(x^{(i)})$ is the predicted value by our model.

To find the best fit line, we can calculate the ideal values of the parameters or can use the concept of Gradient Descent. Let's take a look at both methods.

Calculating Parameters

We know that $L = \frac{1}{2n_{train}} \sum_{i = 1}^{n_{train}}\left ( y^{(i)} - h_\theta(x^{(i)}) \right )^{2}$

$\Rightarrow L = \frac{1}{2n_{train}} \sum_{i = 1}^{n_{train}}\left ( y^{(i)} - (\theta_0 + \theta_1x^{(i)}) \right )^{2}$

$\Rightarrow L = \frac{1}{2n_{train}} \left [ ( y^{(1)} - (\theta_0 + \theta_1x^{(1)}) )^{2} + \cdots + ( y^{(n_{train})} - (\theta_0 + \theta_1x^{(n_{train}) }) )^{2} \right ]$

To simplify the equation, we can use matrix notation, where $x = \begin{bmatrix} 1\\ x \end{bmatrix}$ and $\theta = \begin{bmatrix} \theta_0\\ \theta_1 \end{bmatrix}$

$\Rightarrow L = \frac{1}{2n_{train}} \left [ ( y^{(1)} - (x^{(1)T}\theta) )^{2} + ( y^{(2)} - (x^{(2)T}\theta) )^{2} + \cdots + ( y^{(n_{train})} - (x^{(n_{train})T }\theta) )^{2} \right ]$

To simplify further, we take

$X = \begin{bmatrix} x^{(1)T} \\ x^{(2)T} \\\vdots \\x^{(n_{train})T} \end{bmatrix} = \begin{bmatrix} 1 & x^{(1)} \\ 1 & x^{(1)} \\ \vdots & \vdots \\ 1 & x^{(n_{train})} \end{bmatrix}$   and $Y = \begin{bmatrix} y^{(1)}\\ y^{(2)}\\ \vdots \\ y^{(n_{train})} \end{bmatrix}$

$\Rightarrow L = \frac{1}{2n_{train}}(y - X\theta)^T(y - X\theta)$

$\Rightarrow L = \frac{1}{2n_{train}}(y^T -( X\theta))^T(y - X\theta)$

$\Rightarrow L = \frac{1}{2n_{train}}(y^T - \theta^T X^T)(y - X\theta)$

$\Rightarrow L = \frac{1}{2n_{train}}(y^Ty - y^TX\theta -\theta^TX^Ty + \theta^TX^TX\theta)$

$\Rightarrow L = \frac{1}{2n_{train}}(y^Ty - y^TX\theta -(X\theta)^Ty + \theta^TX^TX\theta)$

$\Rightarrow L = \frac{1}{2n_{train}}(y^Ty - 2y^TX\theta + \theta^TX^TX\theta)$

To minimize the loss function, we need to take the derivative and equate it to zero.

$\triangledown _\theta L = \frac{1}{2n_{train}}(X^T(y - X\theta)) = 0$

$\Rightarrow X^T y - X^T X \theta = 0$

$\Rightarrow X^T X\theta = X^T y$

$\Rightarrow \theta = (X^T X)^{-1}X^T y$

This final result is the optimal value of $\theta = \begin{bmatrix} \theta_0\\ \theta_1 \end{bmatrix}$ . Thus, the required model will be hθ(x)= θ₀+θ₁x

The model calculated by this method for the above example is as follows:

Gradient Descent algorithm is used for calculating the parameters for the model. Although by applying gradient descent we are not guaranteed the optimal result, it is used for many models (not just linear regression). Mathematically calculating the parameters for higher-order models is tedious. Gradient Descent is a general approach for solving these higher-order models.

As discussed earlier, the cost function (RSS) is as follows:

$L = \frac{1}{2n_{train}}\sum_{i = 1}^{n_{train}}(y^{(i)} - h_\theta (x^{(i)}))^2$

We calculate the next value of parameters using the following equation

$\theta^{(k+1)} \leftarrow \theta^{(k)} - \eta \triangledown _\theta L(\theta^{(k)})$

Here $\theta^{(k+1)}$θ is the next value of parameter and

$\eta$ defines the size of the step that we take for each iteration of parameters. Learning parameter defines the size of the step that we take for each iteration of parameters. A large value converges to the optimal solution faster but a smaller value gives a more accurate result.

For linear regression, the above formula is simplified for θ₀ and θ₁ as follows:

Using the above gradient descent formula, the line for the example is as follows :

The above explanation was for simple linear regression. We can add higher-order terms for more accurate representation. We can also add other functions (e.g. trigonometric functions) to add to the complexity of the model. To find the best-fit parameters, we use a gradient descent algorithm. Example of a complex model:  h₀(x) = θ₀ + θ₁ x + θ₂ x⁴ + θ₃ tanx + θ₄ lnx