In this post, I will discuss a seldom documented aspect or trick for one of the simplest model: linear regression. I will show how to make the solution of ridge regression translation invariant and what the meaning of the bias term is. Some people might find it obvious. However it is important for practical purpose. And later when we deal with non-linear models, this trick will not be obvious.

In general, given data we are interested in fitting a following linear model

Most of the textbook will tell you that we don’t have to worry about the bias term . Since we can always augment the variables by adding an extra dimension as and . Then we are only interested in

with augmented variables. Your textbook will tell you that to fit the linear model we minimize

which is called ridge regression. The solution is simply

Nothing is new up to now. However, what the textbook missed, the bias term , is actually important. Here we pursuit another derivation to take a better look at what the means. We can equivalently rewrite the (1) as

What (1) really does is to let . However, it actually make little sense to use non-zero which puts penalty over the bias . For example, if we translate our data, the solution of will be different.

If we want our solution invariant w.r.t. translation, we should let , which means we should minimize

To derivate the solution for this, we first solve (detail omitted here)

to get

Substituting (5) back to the target (4), we have

The solution for then is

This solution means that we first center the data and , then regress the centered data in an usual way as in (2) to get . Then we substitute back to (5) to get . The bias term is just the average of the differences between your data and your regression value . In this way, we have a translation invariant solution for , since we always center our data first.

It is also interesting to see that, the general solution for keeping non-zero is to solve

to get

Here we use a pseudo version of the average notation

This result means that, before we see the data, we assume there are some pseudo samples sitting at the origin 0. The number of these samples equals to . Again, why would one make such an assumption? In my opinion, you should not unless you really know what you are doing.

### Like this:

Like Loading...

*Related*