Last time I talked about linear ridge regression. You might think the trick in last post is trivial. However, applying the trick for nonlinear regression won’t be so obvious any more.

It is natural to extend the linear ridge regression to a nonlinear one using the kernel trick. To use kernels, we assume a (nonlinear) feature map that map from the input space to in feature space, and have the kernel function as the inner product in feature space.

To derive the kernel regression, let me repeat the textbook solution of ridge regression here

** 0.1. Kernel Regression **

An usual way that a textbook deriving the kernel regression is to apply the Woodbury identity. The solution of ridge regression can be rewritten as

The prediction of future input is

where the elements of the vector and matrix are and . We also can rewrite the solution as

where

Again, up to now, nothing is new. And again this derivation does not take the bias term into account. One thing the textbook usually forgot to mention is how to model the bias in the kernel version. The variable augment trick in the linear case is not working here!

You might want to try to augment the input directly as and then compute kernel with . By doing this, you are first augmenting the variable in the input space then map it to feature space. This is not right! Besides for some kernels that are not applied to vectors (for example, graph kernel, string kernel, etc), you cannot even augment the input. In principle, what you really should do is to augment the variable (instead of ) in the feature space, which leads to a nonlinear model that is equivalent to the linear model in the feature space with bias term

These two methods are not equivalent for most of the kernels. The question then is how to incorporate the bias in a kernel formulation.

We can start with solution in the previous post for ridge regression that models the bias explicitly, then apply Woodbury identity as

With this, the prediction becomes

which is pretty intense and still not in a form of kernel formulation. The trick to reduce this to simpler form is kernel centering (meaning, center the data in feature space using kernel trick).

** 0.2. Data centering in feature space **

In general, we can get a new kernel that computes the inner product in feature space for the data that are centered in kernel space as

where

is called center matrix. Or you can expand the new kernel as

which means

To compute , whose elements are where is a future sample that have not been seen in the training data, we can compute

This expands to

*Update:*

In general, computing the inner product of two vectors and in a feature space, of which the origin is the center of sample set , is

With these results, we can simplify the solution as

which can also be rewrite as

where and

The centering trick actually is mentioned by the Kernel PCA paper. However, nobody seems to use it in kernel regression as far as I have seen. Another trick I have never seen and found out by myself is that you actually can simply augment the kernel as

and then directly apply

where

The reason is that if you directly augment the variable in feature space as , and then compute the inner product with augmented variable, you get

Using this trick you don’t have to bother with the centering problem. Although this method is simpler to use, it is actually not solving the exactly same problem as center kernel version shown above. The augmented kernel solution solves

where . Whereas the center kernel solution solves which is more reasonable as indicated in previous post.

** 0.3. Regularization Effect **

One might wonder what optimization problem the solutions corresponding to. Let’s take the center kernel solution as example. By substituting all those quantities into the ridge regression target we have

From which we can see that in order to control the model complexity, we have to control dual form of the norm of in the feature space. That means we have to control the norm of the kernel. This is expected, since the kernel itself will make our function arbitrarily complex if we do not control it.

Some literatures might tell you that kernel regression is to solve

This is wrong! You cannot build the duality relationship with linear ridge regression out of this formulation even without considering the centering effect. And with this formulation, you won’t be able to control the model complexity that the nonlinear kernel brings in.

BTW, who knows how to use bold Greek symbol in wordpress？ I usually use the bm package in Latex, which seems not working here.