Undocumented Machine Learning (I): Linear Regression

In this post, I will discuss a seldom documented aspect or trick for one of the simplest model: linear regression. I will show how to make the solution of ridge regression translation invariant and what the meaning of the bias term is. Some people might find it obvious. However it is important for practical purpose. And later when we deal with non-linear models, this trick will not be obvious.

In general, given data ${\{\mathbf{X},\mathbf{y}\}}$ we are interested in fitting a following linear model

$\displaystyle y = \mathbf{w}^T\mathbf{x}+w_0$

Most of the textbook will tell you that we don’t have to worry about the bias term ${w_0}$. Since we can always augment the variables by adding an extra dimension as ${\tilde{\mathbf{x}}=[1,\mathbf{x}^T]^T}$ and ${\tilde{\mathbf{w}}=[w_0,\mathbf{w}^T]^T}$. Then we are only interested in

$\displaystyle y=\mathbf{w}^T\mathbf{x}$

with augmented variables. Your textbook will tell you that to fit the linear model we minimize

$\displaystyle Q=\|\mathbf{y}-\mathbf{w}^T\mathbf{X}\|^2+\lambda\|\mathbf{w}\|^2 \ \ \ \ \ (1)$

which is called ridge regression. The solution is simply

$\displaystyle \mathbf{w}=(\mathbf{X}\mathbf{X}^T+\lambda \mathbf{I})^{-1}\mathbf{X}\mathbf{y}^T \ \ \ \ \ (2)$

Nothing is new up to now. However, what the textbook missed, the bias term ${w_0}$, is actually important. Here we pursuit another derivation to take a better look at what the ${w_0}$ means. We can equivalently rewrite the (1) as

$\displaystyle Q=\|\mathbf{y}-(\mathbf{w}^T\mathbf{X}+w_0\mathbf{1}^T)\|^2+\lambda\|\mathbf{w}\|^2+\lambda_0w_0^2 \ \ \ \ \ (3)$

What (1) really does is to let ${\lambda_0=\lambda}$. However, it actually make little sense to use non-zero ${\lambda_0}$ which puts penalty over the bias ${w_0}$. For example, if we translate our data, the solution of ${\mathbf{w}}$ will be different.

If we want our solution invariant w.r.t. translation, we should let ${\lambda_0=0}$, which means we should minimize

$\displaystyle Q=\|\mathbf{y}-(\mathbf{w}^T\mathbf{X}+w_0\mathbf{1}^T)\|^2+\lambda\|\mathbf{w}\|^2 \ \ \ \ \ (4)$

To derivate the solution for this, we first solve ${\frac{\partial Q}{\partial w_0}=0}$ (detail omitted here)

$\displaystyle \frac{\partial Q}{\partial w_0}=2nw_0-2(\mathbf{y}-\mathbf{w}^T\mathbf{X})\mathbf{1}=0$

to get

$\displaystyle w_0=\frac{1}{n}(\mathbf{y}-\mathbf{w}^T\mathbf{X})\mathbf{1}=\bar{y}-\mathbf{w}^T\bar{\mathbf{x}} \ \ \ \ \ (5)$

Substituting (5) back to the target (4), we have

$\displaystyle \begin{array}{rcl} Q&=&\|\mathbf{y}-(\mathbf{w}^T\mathbf{X}+(\bar{y}-\mathbf{w}^T\bar{\mathbf{x}})\mathbf{1}^T)\|^2+\lambda\|\mathbf{w}\|^2\\ &=&\|(\mathbf{y}-\bar{y}\mathbf{1}^T)-\mathbf{w}^T(\mathbf{X}-\bar{\mathbf{x}}\mathbf{1}^T)\|^2+\lambda\|\mathbf{w}\|^2 \end{array}$

The solution for ${\mathbf{w}}$ then is

$\displaystyle \mathbf{w}=((\mathbf{X}-\bar{\mathbf{x}}\mathbf{1}^T)(\mathbf{X}-\bar{\mathbf{x}}\mathbf{1}^T)^T+\lambda I)^{-1}(\mathbf{X}-\bar{\mathbf{x}}\mathbf{1}^T)(\mathbf{y}-\bar{y}\mathbf{1}^T)^T \ \ \ \ \ (6)$

This solution means that we first center the data ${\mathbf{X}}$ and ${\mathbf{y}}$, then regress the centered data in an usual way as in (2) to get ${\mathbf{w}}$. Then we substitute ${\mathbf{w}}$ back to (5) to get ${w_0}$. The bias term is just the average of the differences between your data ${y}$ and your regression value ${\mathbf{w}^T\mathbf{X}}$. In this way, we have a translation invariant solution for ${\mathbf{w}}$, since we always center our data first.

It is also interesting to see that, the general solution for keeping ${\lambda_0}$ non-zero is to solve

$\displaystyle \frac{\partial Q}{\partial w_0}=2nw_0-2(\mathbf{y}-\mathbf{w}^T\mathbf{X})\mathbf{1}+2\lambda_0w_0=0$

to get

$\displaystyle \begin{array}{rcl} w_0&=&\frac{1}{n+\lambda_0}(\mathbf{y}-\mathbf{w}^T\mathbf{X})\mathbf{1}\\ &=&\bar{y}-\mathbf{w}^T\bar{\mathbf{x}} \end{array}$

Here we use a pseudo version of the average notation

$\displaystyle \begin{array}{rcl} \bar{y}&=&\frac{1}{n+\lambda_0}\mathbf{y}\mathbf{1}\\ \bar{\mathbf{x}}&=&\frac{1}{n+\lambda_0}\mathbf{X}\mathbf{1} \end{array}$

This result means that, before we see the data, we assume there are some pseudo samples ${\{\mathbf{x},y\}}$ sitting at the origin 0. The number of these samples equals to ${\lambda_0}$. Again, why would one make such an assumption? In my opinion, you should not unless you really know what you are doing.