Understand linear regression through many facets

July 14, 2023



"Another tutorial on linear regression?!" Someone exclaims. Before I start, I feel the need to clarify what this blog is and is not about. After all, there are probably hundreds if not thousands of tutorials, articles, and online courses on this topic.

But first, let me motivate this topic a little using a quote from text Mathematics for Machine Learning.

"Regression is a fundamental problem in machine learning, and regression problems appear in a diverse range of research areas and applications, including time-series analysis (e.g., system identification), control and robotics (e.g., reinforcement learning, forward/inverse model learning), optimization (e.g., line searches, global optimization), and deep learning applications (e.g., computer games, speech-to-text translation, image recognition, automatic video annotation). Regression is also a key ingredient of classification algorithms." -- Deisenroth, M. P., Faisal, A. A.,, Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge University Press.

Interestingly, because of its ubiquitous influence in many fields (ML, stats, optimization, control theory, etc.), I found that the notations and solutions to this problem can be inconsistent and sometimes even convoluted1 2 3.

This is where this post comes in. It’s meant to provide a holistic view on different ways to formulate linear regression problem from the perspectives of machine learning, statistics & probability, and optimization. In this post, we'll introduce linear regression as the most fundamental regression problem. And on top of that, I'll summarize all the different notations and problem formulations I’ve seen so far.

What it is not about: This post is not a step-by-step tutorial to introduce any specific linear regression concept from stretch.

In this post, I assume that readers don’t have specialized knowledge in ML. If you do, feel free to skip any section that is less relevant for you.

Notation Convention

Notations: There are simply too many notations floating around in machine learning space so in this blog we will stick with using inputs (aka features) denoted by xx or XX (data matrix), output (aka prediction, hypothesis) denoted by y^\hat{y}, weights (aka parameters) denoted by ww or WW, bias (aka error) denoted by bb. For more details about different notations please refer to "Miscellaneous: notations" section at the end.

Other notation:

  • m=m= number of training examples
  • n=n= number of inputs not including bias
  • x(i)=x^{(i)}= inputs (features) of ithi^{th} training example (i.e. a vector)
  • xj(i)=x_j^{(i)}= value of input (feature) jj in ithi^{th} training example (i.e. a scalar)

What's linear regression?

My first contact with linear regression dates back to 2020, when I took Andrew Ng’s Machine Learning course on Coursera. Simply put, linear regression is an approach to finding a linear equation (model) that takes in one or more inputs and outputs a continuous output variable (target variable).

An example could be predicting sales (output variable) based on TV’s advertising budget (input variable). From the graph below, red dots represent the historic data; the blue line is that linear equation (our model) that we magically found, which outputs the best predictions for both training & testing data (aka "generalize well").

How do we use the model? Now if I want to predict sales with a TV advertising budget of 200 units using this model, I can trace my finder up from the 200 tick on the x-axis until it meets the blue line (i.e., our model), then trace it to the left until it touches the y-axis. The exercise would give you a prediction about 16–17 units of sales.

lr example
Simple linear regression modeling TV advertising budget against sales


We can write out this linear equation (blue line) in a mathematic form:

y^=b+w1x1,\hat{y} = b + w_1 x_1,

where y^\hat{y} is the output variable, our predicted height, and xx is our input, weight.

A more general format to write it in case we're dealing with multiple inputs (multi-variate linear regression) is:

y^=bx0+w1x1++wnxn=XW=[x0(1)x1(1)...xn(1)x0(m)x1(m)...xn(m)][w0w1wn],\hat{y} = b \pmb{x_0} + w_1 \pmb{x_1} + \dots + w_n \pmb{x_n} = XW = \begin{bmatrix} x_0^{(1)} & x_1^{(1)} & ... & x_n^{(1)} \\ \vdots & \vdots & \ddots & \vdots \\ x_0^{(m)} & x_1^{(m)} & ... & x_n^{(m)} \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_n \end{bmatrix},

where xiRm×1\pmb{x_i} \in \mathbb{R}^{m \times 1}, XRm×(n+1)X \in \mathbb{R}^{m \times (n+1)} (data matrix), WR(n+1)×1W \in \mathbb{R}^{(n+1) \times 1}, y^Rm×1\hat{y} \in \mathbb{R}^{m \times 1}. Note that here we let all x0=1x_0=1 for every row in XX to complete the data matrix so that we can include bias as a weight (w0=bw_0=b).

How to fit a LR model?

Each formulation will be discussed in two parts:

  1. high-level intuition
  2. some math behind it

I won't go too deep into the math here as you can find lots of other articles and textbook reference on math. If you'd like, you can totally skip the math part and still understand each formulation pretty well.

Formulation 1: Gradient descent ("the ML way")


Gradient descent is probably the most popular opitimization algorithm right now in the machine learning space, as it’s fundamental to training not only classic ML models but also all deep neural networks that require backpropagation.

The goal here is to minimize the cost function in high-dimensional space by iteratively taking "small steps" in the direction with a lower cost. One analogy in the 3D world could be descending a mountain from its peak; here, the altitude will be your cost function’s output ff. How many weights are you optimizing then? You probably would have guessed—it’s your latitude and longitude (w0w_0 and w1w_1 below!). Here’s a visual to help you imagine.

What if we have more than two inputs (features) in our dataset? The math is the same: simply take the partial derivative w.r.t. the weight of each input variable from w0,,wnw_0, \dotsb, w_n where n is the number of input variables.

gradient-descent example
Note: the notation here uses J for cost function and theta for weights

A high-level summary of the steps takes in gradient descent is described below:

  1. Define a cost function f(w)f(\vec{w}) (e.g. mean squared error (MSE), mean absolute error (MAE), etc.)
  2. Minimize f(w)f(\vec{w}) using gradient descent technique until the change of ff between iteration ii and i+1i+1 converge within a bound
    1. Take a step towards the opposite direction of the gradient of cost function
    2. Adjust ww
    3. Take another step
    4. Repeat until convergence


Step 1: Define a MSE cost function f(w)f(\vec{w})

f(w)=f(w0,,wn)=12mi(y^(i)y(i))2=12mXWy2f(w) = f(w_0, \dotsb, w_n) = \frac{1}{2m} \sum_i (\hat{y}^{(i)} - y^{(i)})^2 = \frac{1}{2m} || XW - y ||_2

Step 2: Take the gradient w.r.t each wiw_i and update wiw_i simultaneously through vectorization

wj:=wjαwjf(w)=wjα1mi=1m(y^(i)y(i))xj(i)w_j := w_j - \alpha \frac{\partial}{\partial w_j} f(w) = w_j - \alpha \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})x_j^{(i)}

Step 3: Compute loss and if Δf>predefined threshold\Delta f > \text{predefined threshold} then repeat step 2 until converge.

Formulation 2: The Normal equation


Whenever we approach a mathematical problem, in this case, minimizing our chosen cost function, there are usually two types of solutions: analytical solutions and numerical solutions.

Analytical solutions have a strict logic and return an exact answer; for example, factorizing a matrix using singular value decomposition or other linear algebra methods based on the properties of the matrix.

Unfortunately, there are many problems that do not have an exact solution, such as the travelling salesman problem. A numerical solution makes reasonable guesses and eventually returns a "good enough" solution that meets certain constraints. Gradient descent gives us a numerical solution of weights because it’s nearly impossible to reach the global minimum once you find yourself optimizing a non-convex cost function in a high-dimensional space like neural networks. Machine Learning Mastery gives a more detailed illustration of the differences.

Normal equation is a analytical solution to find exact WR(n+1)W \in \mathbb{R}^{(n+1)} using data matrix XRm×(n+1)X \in \mathbb{R}^{m \times (n+1)} and ground truth yRmy \in \mathbb{R}^{m}.


Recall previously we're able to write y^\hat{y} in matrix notation:

y^=XW\hat{y} = XW

Then we can substitute the above expression into the cost function f(w)f(w) to rewrite the sum as matrix notation as well:

f(w)=12mi(y^(i)y(i))2=12mXWy2f(w) = \frac{1}{2m} \sum_i (\hat{y}^{(i)} - y^{(i)})^2 = \frac{1}{2m} || XW - y ||_2

2\| \cdot \|_2 denotes the Euclidean norm of a vector defined as x2:=x12++xn2=xTx\|x\|_2 := \sqrt{x_1^2 + \cdots + x_n^2} = x^Tx, which is equivalent to the \sum expression! Now expand the Euclidean norm using matrix transpose identities you will get:

f(w)=12m(XWy)T(XWy)=((XW)TyT)(XWy)f(w) = \frac{1}{2m} (XW - y)^T(XW - y) = ((XW)^T - y^T)(XW - y) f(w)=(XW)TXW(XW)TyyT(XW)+yTyf(w) = (XW)^T XW - (XW)^Ty - y^T(XW) + y^Ty

Note that XWXW is a vector, and so is yy. So when we multiply one by another, the order does not matter (i.e. xTy=yTxx^Ty = y^Tx). Further simplify the expression into:

f(w)=WXTXW2(XW)Ty+yTyf(w) = WX^TXW - 2(XW)^Ty + y^Ty

To find our unknown WR(n+1)W \in \mathbb{R}^{(n+1)}, we will derive by WW and compare to zero. Deriving by a vector involves matrix caculus but in the end we just derive by each component of the vector, and then combine the resulting derivatives into a vector again. Eli Bendersky has a wonderfun blog explaining how to derive matrix notation if you're interested. The result is:

fW=2XTXW2XTy=0XTXW=XTy\frac{\partial f}{\partial W} = 2X^TXW - 2X^Ty = \vec{0} \Longrightarrow X^TXW = X^Ty

Now, assuming that the matrix XTXX^TX is invertible, we can multiply both sides by (XTX)1(X^TX)^{-1} and get:

W=(XTX)1XTyW^{*} = (X^TX)^{-1}X^Ty

This is the normal equation.

Formulation 3: Minimum likelihood estimation (MLE)

In additional to the first two common ways to look at linear regression, we can take a probablistic approach to formulate the problem and by the end you will see they have very similar steps.


Let's say we get some training examples (i..e our training dataset) and fit a linear regression model (curve fitting) that models these training examples. We assume there's a function ff that maps inputs xRnx \in \mathbb{R}^n to corresponding funciton values f(x)Rf(x) \in \mathbb{R}. The ground truths can then be expressed as y=f(x)+ϵy = f(x) + \epsilon, where ϵ\epsilon is an i.i.d random variable that describes observation noise, which we have no control of.

The goal is to infer the function ff that generated the training data and generalizes well to new input data. Since we're dealing parametric models, a parametric estimation method like MLE or MAP can be applied to estimate the best parameters WW^{*} given certain data.

Note: ϵ\epsilon is just the bias bb in formulation 1 and 2 but we choose different notation here to stick with probabilistic convention.


The mathematical expression of the model is given by

y=xTW+ϵ,ϵN(0,σ2)y = x^TW + \epsilon, \epsilon \sim \mathcal{N}(0, \sigma^2)

Assuming the noise has a Gaussian underlying distribution and we know its variance for the time being. We obtain the maximum likelihood parameters as

WMLE=argmaxWp(yX,W)=argminWlogp(yX,W)W_{MLE} = arg\max_W p(y|X, W) = arg\min_W -\log p(y| X, W)

To find the optimal parameters, we then minimize the negative log-likelihood

logp(yX,W)=logΠi=1mp(y(i)x(i),W)=i=1mlogp(y(i)x(i),W)-\log p(y | X, W) = -\log \Pi_{i=1}^m p(y^{(i)} | x^{(i)}, W) = - \sum_{i=1}^m \log p(y^{(i)} | x^{(i)}, W) L(W):=12σ2i=1m(x(i)Wy(i))2=12σ2i=1m(y^(i)y(i))2=12mXWy2\Longrightarrow \mathcal{L}(W) := \frac{1}{2\sigma^2} \sum_{i=1}^m (x^{(i)}W - y^{(i)})^2 = \frac{1}{2\sigma^2} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2 = \frac{1}{2m} || XW - y ||_2

There you go! So the likelihood function L\mathcal{L} is just as same as the cost function in method 2 with different names. The rest is the same - take derivative of L\mathcal{L} w.r.t the parameters and equalize it to zero and solve for the optimal parameters. And you probably guessed it, the result is just normal equation. However, the amazing part is we started from a very different formulation.

Formulation 4: Maximum a-priori estimation (MAP)


MLE and the normal equation are both prone to over-fitting due to its closed-form nature, which means they will fit a curve as perfect as possible given training data. This approach usually won't generalize to testing data and in reality could be computationally costly to compute the inverse of a matrix.

This is where MAP estimation comes in. We can place a prior distribution p(W)p(W) on the weights. Once a dataset XX, yy is available, instead of maximizing the likelihood we seek parameters that maximize the posterior distribution p(WX,y)p(W \mid X, y) to find the optimal parameters.


The posterior over the parameters WW, given the training data XX, yy is obtained by applying Bayes' theorem and then followed by a log-transform as

logp(WX,y)=logp(yX,W)+logp(W)+const\log p(W | X, y) = \log p(y | X, W) + \log p(W) + const

where the constant comprises the terms that are independent of WW. The MAP estimate will be a “compromise” between the prior (our suggestion for plausible parameter values before observing data) and the data-dependent likelihood.

To find the MAP estimate WMAPW_{MAP}, we minimize the negative log-posterior distribution with respect to WW, i.e., we solve

WMAPargminW{logp(yX,W)logp(W)}W_{MAP} \in arg\min_W \{-\log p(y|X, W) - \log p(W)\}

Formulation 5: Least squares problem


Another way to look at a regression problem is to realize that it's a subset of least squares problem. Imagine that we have a matrix AA of size 3×23 \times 2 (i.e. m>nm > n more training examples than number of features), then Ax=bAx = b usually does not have a solution. This is called an overdetermined case in least squares problem where we can't find a perfect solution so instead we seek the least squares solution.


We then try to find set of parameters xx that minimizes the sum of squares of the error

minxRnAxb2\min_{x \in \mathbb{R}^n} ||Ax - b||_2

the solution of optimal parameters xx^* is

x=Ab=(ATA)1ATbx^* = A^{\dagger}b = (A^TA)^{-1}A^Tb

Again, you will find we arrive at the same answer solved in formulation 2 and 3. This tutorial created by Alex Townsend from MIT does an amazing job at explaining how we can get from least squares problem to normal equation.

Miscellaneous: notations


Thanks to Ian Webster for reviewing and making this post better.

Convention used in this blogOther Notations
output variable =y^=y^(x)= \hat{y} = \hat{y}(x)prediction, hypothesis (function) =h(x)=hθ(x)= h(x) = h_{\theta}(x)
input variable =x/X= x/ Xpredictor, regressor, independent variable, observation / data matrix, feature matrix
ground truth =y= yresponse, label, dependent variable, target variable
cost function =f(w)= f(w)loss function, error function, J(θ)J(\theta), objective function
bias =b= berror ϵ\epsilon, observation noise
weight =w/W= w/Wparameter θ\theta, coefficient
learning rate =α= \alpha-


  1. https://online.stat.psu.edu/stat462/node/272/

  2. https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86

  3. https://madewithml.com/courses/foundations/linear-regression/