## Undocumented Machine Learning (II): Kernel Regression

Last time I talked about linear ridge regression. You might think the trick in last post is trivial. However, applying the trick for nonlinear regression won’t be so obvious any more.

It is natural to extend the linear ridge regression to a nonlinear one using the kernel trick. To use kernels, we assume a (nonlinear) feature map ${\phi_i=\phi(\mathbf{x}_i)}$ that map ${\mathbf{x}_i}$ from the input space to ${\phi_i}$ in feature space, and have the kernel function ${\kappa(\mathbf{x}_i,\mathbf{x}_j)=\left<\phi_i,\phi_j\right>}$ as the inner product in feature space.

To derive the kernel regression, let me repeat the textbook solution of ridge regression here

$\displaystyle \mathbf{w}=(\Phi\Phi^T+\lambda \mathbf{I})^{-1}\Phi\mathbf{y}^T$

0.1. Kernel Regression

An usual way that a textbook deriving the kernel regression is to apply the Woodbury identity. The solution of ridge regression can be rewritten as

$\displaystyle \mathbf{w}=( \Phi \Phi^T+\lambda \mathbf{I})^{-1} \Phi\mathbf{y}^T= \Phi( \Phi^T \Phi+\lambda \mathbf{I})^{-1}\mathbf{y}^T$

The prediction ${y}$ of future input ${\mathbf{x}}$ is

$\displaystyle \begin{array}{rcl} y&=&\mathbf{w}^T \phi\\ &=& \phi^T \Phi( \Phi^T \Phi+\lambda \mathbf{I})^{-1}\mathbf{y}^T\\ &=&\mathbf{k}(\mathbf{K}+\lambda \mathbf{I})^{-1}\mathbf{y}^T \end{array}$

where the elements of the vector ${\mathbf{k}}$ and matrix ${\mathbf{K}}$ are ${k_i=\kappa(\mathbf{x},\mathbf{x}_i)}$ and ${K_{ij}=\kappa(\mathbf{x}_i,\mathbf{x}_j)}$. We also can rewrite the solution as

$\displaystyle y=\sum_i\alpha_i\kappa(\mathbf{x},\mathbf{x}_i)$

where

$\displaystyle \alpha=(\mathbf{K}+\lambda \mathbf{I})^{-1}\mathbf{y}^T$

Again, up to now, nothing is new. And again this derivation does not take the bias term into account. One thing the textbook usually forgot to mention is how to model the bias in the kernel version. The variable augment trick in the linear case is not working here!

You might want to try to augment the input directly as ${\tilde{\mathbf{x}}=[1,\mathbf{x}^T]^T}$ and then compute kernel with ${\kappa(\tilde{\mathbf{x}}_i,\tilde{\mathbf{x}}_j)}$. By doing this, you are first augmenting the variable in the input space then map it to feature space. This is not right! Besides for some kernels that are not applied to vectors (for example, graph kernel, string kernel, etc), you cannot even augment the input. In principle, what you really should do is to augment the variable ${\phi}$ (instead of ${\mathbf{x}}$) in the feature space, which leads to a nonlinear model that is equivalent to the linear model in the feature space with bias term

$\displaystyle y=\mathbf{w}^T \phi+w_0$

These two methods are not equivalent for most of the kernels. The question then is how to incorporate the bias in a kernel formulation.

We can start with solution in the previous post for ridge regression that models the bias explicitly, then apply Woodbury identity as

$\displaystyle \begin{array}{rcl} \mathbf{w}=&(( \Phi-\bar{ \phi}\mathbf{1}^T)( \Phi-\bar{ \phi}\mathbf{1}^T)^T+\lambda \mathbf{I})^{-1}( \Phi-\bar{ \phi}\mathbf{1}^T)(\mathbf{y}-\bar{y}\mathbf{1}^T)^T\\ =&( \Phi-\bar{ \phi}\mathbf{1}^T)(( \Phi-\bar{ \phi}\mathbf{1}^T)^T( \Phi-\bar{ \phi}\mathbf{1}^T)+\lambda \mathbf{I})^{-1}(\mathbf{y}-\bar{y}\mathbf{1}^T)^T \end{array}$

With this, the prediction becomes

$\displaystyle \begin{array}{rcl} y&=&\mathbf{w}^T \phi+w_0=\mathbf{w}^T \phi+\bar{y}-\mathbf{w}^T\bar{ \phi}=\mathbf{w}^T( \phi-\bar{ \phi})+\bar{y}\\ &=&( \phi-\bar{ \phi})^T ( \Phi-\bar{ \phi}\mathbf{1}^T)(( \Phi-\bar{ \phi}\mathbf{1}^T)^T( \Phi-\bar{ \phi}\mathbf{1}^T)+\lambda I)^{-1}(\mathbf{y}-\bar{y}\mathbf{1}^T)^T+\bar{y} \end{array}$

which is pretty intense and still not in a form of kernel formulation. The trick to reduce this to simpler form is kernel centering (meaning, center the data in feature space using kernel trick).

0.2. Data centering in feature space

In general, we can get a new kernel ${\bar\kappa(\mathbf{x}_i,\mathbf{x}_j)}$ that computes the inner product in feature space for the data that are centered in kernel space as

$\displaystyle \begin{array}{rcl} \bar{\mathbf{K}}&=&( \Phi-\bar{ \phi}\mathbf{1}^T)^T( \Phi-\bar{ \phi}\mathbf{1}^T)\\ &=&( \Phi-\frac{1}{n} \Phi\mathbf{1}\mathbf{1}^T)^T( \Phi-\frac{1}{n} \Phi\mathbf{1}\mathbf{1}^T)\\ &=&(\mathbf{I}-\frac{1}{n}\mathbf{1}\mathbf{1}^T) \Phi^T \Phi(\mathbf{I}-\frac{1}{n}\mathbf{1}\mathbf{1}^T)\\ &=&\mathbf{H}\mathbf{K}\mathbf{H} \end{array}$

where

$\displaystyle \mathbf{H}=\mathbf{I}-\frac{1}{n}\mathbf{1}\mathbf{1}^T$

is called center matrix. Or you can expand the new kernel as

$\displaystyle \bar{\mathbf{K}}=\mathbf{K}-\frac{1}{n}\mathbf{1}\mathbf{1}^T\mathbf{K}-\frac{1}{n}\mathbf{K}\mathbf{1}\mathbf{1}^T+\frac{1}{n}\mathbf{1}\mathbf{1}^T\mathbf{K}\mathbf{1}\mathbf{1}^T\frac{1}{n}$

which means

$\displaystyle \begin{array}{rcl} \bar\kappa(\mathbf{x}_i,\mathbf{x}_j)&=&\kappa(\mathbf{x}_i,\mathbf{x}_j)-\frac{1}{n}\sum_k\kappa(\mathbf{x}_i,\mathbf{x}_k)\\ &&-\frac{1}{n}\sum_k\kappa(\mathbf{x}_j,\mathbf{x}_k)+\frac{1}{n}\sum_i\frac{1}{n}\sum_j\kappa(\mathbf{x}_i,\mathbf{x}_j) \end{array}$

To compute ${\bar{\mathbf{k}}}$, whose elements are ${\bar{k}_i=\bar\kappa(\mathbf{x},\mathbf{x}_i)}$ where ${\mathbf{x}}$ is a future sample that have not been seen in the training data, we can compute

$\displaystyle \begin{array}{rcl} \bar{\mathbf{k}}&=& (\phi-\bar{ \phi})^T( \Phi-\bar{ \phi}\mathbf{1}^T)\\ &=&( \phi-\frac{1}{n} \Phi\mathbf{1})^T \Phi\mathbf{H}\\ &=&( \phi^T \Phi-\frac{1}{n}\mathbf{1}^T \Phi^T \Phi)\mathbf{H}\\ &=&(\mathbf{k}^T-\frac{1}{n}\mathbf{1}^T\mathbf{K})\mathbf{H} \end{array}$

This expands to

$\displaystyle \begin{array}{rcl} \bar\kappa(\mathbf{x},\mathbf{x}_i)&=&\kappa(\mathbf{x},\mathbf{x}_i)-\frac{1}{n}\sum_k\kappa(\mathbf{x},\mathbf{x}_k)\\ &&-\frac{1}{n}\sum_k\kappa(\mathbf{x}_i,\mathbf{x}_k)+\frac{1}{n}\sum_i\frac{1}{n}\sum_j\kappa(\mathbf{x}_i,\mathbf{x}_j) \end{array}$

Update:

In general, computing the inner product of two vectors ${\phi}$ and ${\phi'}$ in a feature space, of which the origin is the center of sample set ${\Phi}$, is

$\displaystyle \begin{array}{rcl} \bar{\kappa}(\mathbf{x},\mathbf{x}')&=&\left<\phi-\bar{\phi},\phi'-\bar\phi\right>=\left<\phi-\frac{1}{n}\sum_i\phi_i,\phi'-\frac{1}{n}\sum_i\phi_i\right>\\ &=&\kappa(\mathbf{x},\mathbf{x}')-\frac{1}{n}\sum_i\kappa(\mathbf{x},\mathbf{x}_i)\\ &&-\frac{1}{n}\sum_i\kappa(\mathbf{x}',\mathbf{x}_i)+\frac{1}{n}\sum_i\frac{1}{n}\sum_j\kappa(\mathbf{x}_i,\mathbf{x}_j) \end{array}$

With these results, we can simplify the solution as

$\displaystyle y=\bar{\mathbf{k}}(\bar{\mathbf{K}}+\lambda\mathbf{I})^{-1}(\mathbf{y}-\bar{y}\mathbf{1}^T)^T+\bar{y}$

which can also be rewrite as

$\displaystyle y=\sum_i\alpha_i\bar{\kappa}(\mathbf{x},\mathbf{x}_i)+\alpha_0$

where ${\alpha_0=\bar{y}}$ and

$\displaystyle \alpha=(\bar{\mathbf{K}}+\lambda \mathbf{I})^{-1}(\mathbf{y}-\bar{y}\mathbf{1}^T)^T$

The centering trick actually is mentioned by the Kernel PCA paper. However, nobody seems to use it in kernel regression as far as I have seen. Another trick I have never seen and found out by myself is that you actually can simply augment the kernel as

$\displaystyle \tilde{\kappa}(\mathbf{x}_i,\mathbf{x}_j)=\kappa(\mathbf{x}_i,\mathbf{x}_j)+1$

and then directly apply

$\displaystyle y=\sum_i\alpha_i\tilde\kappa(\mathbf{x},\mathbf{x}_i)$

where

$\displaystyle \alpha=(\tilde{\mathbf{K}}+\lambda \mathbf{I})^{-1}\mathbf{y}^T$

The reason is that if you directly augment the variable in feature space as ${\tilde\phi=[1,\phi^T]^T}$, and then compute the inner product with augmented variable, you get

$\displaystyle \tilde\kappa(\mathbf{x}_i,\mathbf{x}_j)=\left<\tilde\phi_i,\tilde\phi_j\right>=\left<\phi_i,\phi_j\right>+1=\kappa(\mathbf{x}_i,\mathbf{x}_j)+1$

Using this trick you don’t have to bother with the centering problem. Although this method is simpler to use, it is actually not solving the exactly same problem as center kernel version shown above. The augmented kernel solution solves

$\displaystyle \|\mathbf{y}-(\mathbf{w}^T\Phi+w_0\mathbf{1}^T)\|^2+\lambda\|\mathbf{w}\|^2+\lambda_0w_0^2$

where ${\lambda_0=\lambda}$. Whereas the center kernel solution solves ${\lambda_0=0}$ which is more reasonable as indicated in previous post.

0.3. Regularization Effect

One might wonder what optimization problem the solutions corresponding to. Let’s take the center kernel solution as example. By substituting all those quantities into the ridge regression target we have

$\displaystyle \min_{\alpha}~\sum_i\|y_i-\sum_{j}\alpha_j\bar{\kappa}(\mathbf{x}_i,\mathbf{x}_j)-\alpha_0\|^2+\lambda\sum_i\sum_j\alpha_i\alpha_j\bar{\kappa}(\mathbf{x}_i,\mathbf{x}_j)$

From which we can see that in order to control the model complexity, we have to control dual form of the norm of ${\mathbf{w}}$ in the feature space. That means we have to control the norm of the kernel. This is expected, since the kernel itself will make our function arbitrarily complex if we do not control it.

Some literatures might tell you that kernel regression is to solve

$\displaystyle \min_{\alpha}~\sum_i\|y_i-\sum_{j}\alpha_j\kappa(\mathbf{x}_i,\mathbf{x}_j)\|^2+\lambda\|\alpha\|$

This is wrong! You cannot build the duality relationship with linear ridge regression out of this formulation even without considering the centering effect. And with this formulation, you won’t be able to control the model complexity that the nonlinear kernel brings in.

BTW, who knows how to use bold Greek symbol in wordpress？ I usually use the bm package in Latex, which seems not working here.