In this post, I will discuss a seldom documented aspect or trick for one of the simplest model: linear regression. I will show how to make the solution of ridge regression translation invariant and what the meaning of the bias term is. Some people might find it obvious. However it is important for practical purpose. And later when we deal with non-linear models, this trick will not be obvious.
In general, given data we are interested in fitting a following linear model
Most of the textbook will tell you that we don’t have to worry about the bias term . Since we can always augment the variables by adding an extra dimension as and . Then we are only interested in
Nothing is new up to now. However, what the textbook missed, the bias term , is actually important. Here we pursuit another derivation to take a better look at what the means. We can equivalently rewrite the (1) as
What (1) really does is to let . However, it actually make little sense to use non-zero which puts penalty over the bias . For example, if we translate our data, the solution of will be different.
To derivate the solution for this, we first solve (detail omitted here)
This solution means that we first center the data and , then regress the centered data in an usual way as in (2) to get . Then we substitute back to (5) to get . The bias term is just the average of the differences between your data and your regression value . In this way, we have a translation invariant solution for , since we always center our data first.
It is also interesting to see that, the general solution for keeping non-zero is to solve
Here we use a pseudo version of the average notation
This result means that, before we see the data, we assume there are some pseudo samples sitting at the origin 0. The number of these samples equals to . Again, why would one make such an assumption? In my opinion, you should not unless you really know what you are doing.