## Undocumented Machine Learning (III): From Probabilistic Regression to Gaussian Process

In this post, I will show how directly derive the Gaussian process formulation of nonlinear regression from probabilistic linear regression model. Doing this, some common misunderstanding of GP will be clear.

First, for reference purpose, let’s write down the probabilistic formulation for linear regression.

0.1. Probabilistic Linear Regression

For a linear model

$\displaystyle y=\mathbf{w}^T\mathbf{x}+\epsilon$

where the noise ${\epsilon}$ follows ${\mathcal{N}(\epsilon|0,\beta^{-1})}$, the likelihood is

$\displaystyle p(y|\mathbf{w},\mathbf{x})=\mathcal{N}(y|\mathbf{w}^T\mathbf{x},\beta^{-1})$

Assuming the priors is

$\displaystyle p(\mathbf{w})=\mathcal{N}(\mathbf{w}|0,\alpha^{-1}\mathbf{I})$

The posterior then is

$\displaystyle p(\mathbf{w}|\mathbf{y},\mathbf{X})=\mathcal{N}(\mathbf{w}|\mathbf{m}_n,\mathbf{S}_n)$

where

$\displaystyle \begin{array}{rcl} \mathbf{m}_n&=&\beta\mathbf{S}_n\mathbf{X}\mathbf{y}^T\\ \mathbf{S}_n&=&(\alpha \mathbf{I}+\beta\mathbf{X}\mathbf{X}^T)^{-1} \end{array}$

Given new samle ${\mathbf{x}}$, the predictive distribution of a value ${y}$ is

$\displaystyle \begin{array}{rcl} p(y|\mathbf{y},\mathbf{x},\mathbf{X})&=&\int p(y|\mathbf{x},\mathbf{w})p(\mathbf{w}|\mathbf{y},\mathbf{X})d\mathbf{w}\\ &=&\mathcal{N}(y|m(\mathbf{x}),\sigma^2(\mathbf{x})) \end{array}$

where

$\displaystyle \begin{array}{rcl} m(\mathbf{x})&=&\mathbf{m}_n^T\mathbf{x}=(\beta^{-1}\alpha\mathbf{I}+\mathbf{X}\mathbf{X})^{-1}\mathbf{X}\mathbf{y}^T\\ \sigma^2(x)&=&\beta^{-1}+\mathbf{x}^T\mathbf{S}_n\mathbf{x} \end{array}$

These are textbook illustration which can be found in PRML.

To deal with the bias term ${w_o}$ for a linear model

$\displaystyle p(\mathbf{y}|\mathbf{X},\mathbf{w},w_0)=\mathcal{N}(\mathbf{w}^T\mathbf{X}+w_0\mathbf{I},\beta^{-1} \mathbf{I})$

one way is to use the augment variable trick ${\tilde{\mathbf{x}}=[1,\mathbf{x}^T]^T}$. Then applying above formulation is equivalent to use a prior ${p(w_0|0,\alpha_0)}$ with ${\alpha_0=\alpha}$ for ${w_0}$. It does not make much sense to do so. As indicated before, By incorporating this prior, you lose the translation invariant property. To fix this, we can use non-informative prior for ${w_0}$, where we simply make ${\alpha_0=0}$. Then we can derive the posterior

$\displaystyle$

0.2. Gaussian Process

Textbooks often directly present you the Gaussian process, then show the predictive distribution of GP is actually equivalent to linear regression when linear kernel used. However, since there is equivalence, we should be able to derive from one to another. Usually your textbook does not tell you how.

Here is the PRML way to present the GP (where we switch the symbols for consistency sake)

$\displaystyle p(\mathbf{y}|\mathbf{t})=\mathcal{N}(\mathbf{y}|\mathbf{t},\beta^{-1}\mathbf{I})$

where

$\displaystyle p(\mathbf{t})=\mathcal{N}(\mathbf{t}|0,\mathbf{K}) \ \ \ \ \ (1)$

Then marginalizing out ${\mathbf{t}}$ you get

$\displaystyle p(\mathbf{y})=\int p(\mathbf{y}|\mathbf{t})p(\mathbf{t})d\mathbf{t}=\mathcal{N}(\mathbf{y}|0,\mathbf{K}+\beta^{-1}\mathbf{I}) \ \ \ \ \ (2)$

Then inference becomes

$\displaystyle p(y|\mathbf{y})=\frac{p([y,\mathbf{y}])}{p(\mathbf{y})}=\mathcal{N}(y|m(\mathbf{x}),\sigma^2(\mathbf{x})) \ \ \ \ \ (3)$

where

$\displaystyle \begin{array}{rcl} m(\mathbf{x})&=&\mathbf{k}^T(\mathbf{K}+\beta^{-1}\mathbf{I})^{-1}\mathbf{t}\\ \sigma^2(\mathbf{x})&=&c+\mathbf{k}^T(\mathbf{K}+\beta^{-1}\mathbf{I})^{-1}\mathbf{k} \end{array}$

where ${c=\kappa(\mathbf{x},\mathbf{x})+\beta^{-1}}$. However, this argument is not strictly correct, since the kernel matrix ${\mathbf{K}}$ will almost certainly be singular. Therefore the distribution (1) will be ill-defined. Actually, we can derive the Gaussian process from linear regression without involving the ill-defined distribution. In Bayesian inference, we all know that given likelihood ${p(\mathbf{X}|\theta)}$, we assume prior ${p(\theta)}$ to derive the posterior ${p(\theta|\mathbf{X})}$, then inference by marginalize the parameter

$\displaystyle p(\mathbf{x}|\mathbf{X})=\int p(\mathbf{x}|\theta)p(\theta|\mathbf{X})d\theta$

Actually, there is another way that we can first marginalize parameter ${\theta}$ by

$\displaystyle p(\mathbf{X})=\int p(\mathbf{X}|\theta)p(\theta)\theta$

By integrate out the parameter ${\theta}$, the samples are no more independent, this is called explain away effect. Then we can do inference by

$\displaystyle p(\mathbf{x}|\mathbf{X})=\frac{p([\mathbf{x},\mathbf{X}])}{p(\mathbf{X})}$

To derive GP formulation of regression from linear regression, one way is to first marginalize out ${\mathbf{w}}$ from ${p(\mathbf{y}|\mathbf{X},\mathbf{w})=\mathcal{N}(\mathbf{y}|\mathbf{w}^T\mathbf{X},\beta^{-1} \mathbf{I})}$, the marginal likelihood is

$\displaystyle \begin{array}{rcl} p(\mathbf{y}|\mathbf{X})&=&\int p(\mathbf{y}|\mathbf{X},\mathbf{w})p(\mathbf{w})d\mathbf{w}\\ &=&\int \mathcal{N}(\mathbf{y}|\mathbf{w}^T\mathbf{X},\beta^{-1} \mathbf{I})\mathcal{N}(\mathbf{w}|0,\alpha^{-1} \mathbf{I})d\mathbf{w}\\ &=&\mathcal{N}(\mathbf{y}|0,\alpha^{-1}\mathbf{X}^T\mathbf{X}+\beta^{-1}\mathbf{I}) \end{array}$

This marginal likelihood is equivalent to (2). Therefore you can see, in order to establish the exact equivalence, the kernel matrix should be ${\mathbf{K}=\alpha\mathbf{X}^T\mathbf{X}}$. The predictive distribution then is

$\displaystyle p(y|\mathbf{y},[\mathbf{x},\mathbf{X}])=\frac{p([y,\mathbf{y}]|[\mathbf{x},\mathbf{X}])}{p(\mathbf{y}|\mathbf{X})}=\mathcal{N}(y|m(\mathbf{x}),\sigma^2(\mathbf{x}))$

where

$\displaystyle \begin{array}{rcl} m(\mathbf{x})&=&\mathbf{x}^T\mathbf{X}(\mathbf{X}^T\mathbf{X}+\beta^{-1}\alpha\mathbf{I})^{-1}\mathbf{y}^T\\ \sigma^2(\mathbf{x})&=&\beta^{-1}+\mathbf{x}^T\mathbf{S}_n\mathbf{x} \end{array}$

Then you can substitute the inner product in above formulation with kernel to get nonlinear regression.

Be careful when you see a Gaussian process ${\mathrm{GP}(f(\mathbf{x})|m(\mathbf{x}),\kappa(\mathbf{x},\mathbf{x}'))}$, the kernel here is not the usual kernel you used in other kernel methods. It has to be positive definite, otherwise the marginal distribution wont be a valid distribution. To be more clear, maybe we should use something like ${\mathrm{GP}(f(\mathbf{x})|m(\mathbf{x}),\lambda\kappa(\mathbf{x},\mathbf{x}')+\epsilon\delta(\mathbf{x},\mathbf{x}'))}$, where ${\lambda=\alpha^{-1}}$, ${\epsilon=\beta^{-1}}$ and the kernel then is the usual kernel used in non-probabilistic kernel methods.

Dealing with bias in GP is very tricky. One way is to use the augment kernel ${\tilde{\kappa}(\mathbf{x},\mathbf{x}')=\kappa(\mathbf{x},\mathbf{x}')+1}$ as suggested in previous post. Then define a Gaussian process ${\mathrm{GP}(f(\mathbf{x})|0,\lambda\tilde{\kappa}(\mathbf{x},\mathbf{x}')+\epsilon\delta(\mathbf{x},\mathbf{x}'))}$ and regress with it as in (3). Again, you can not expect translation invariance by using this.

Another way is to use the centering kernel

$\displaystyle \begin{array}{rcl} \bar{\kappa}(\mathbf{x},\mathbf{x}')&=&\kappa(\mathbf{x},\mathbf{x}')-\frac{1}{n}\sum_i\kappa(\mathbf{x},\mathbf{x}_i)\\ &&-\frac{1}{n}\sum_i\kappa(\mathbf{x}',\mathbf{x}_i)+\frac{1}{n}\sum_i\frac{1}{n}\sum_j\kappa(\mathbf{x}_i,\mathbf{x}_j) \end{array}$

then define a Gaussian process ${\mathrm{GP}(f(\mathbf{x})|\bar{y},\lambda\bar{\kappa}(\mathbf{x},\mathbf{x}')+\epsilon\delta(\mathbf{x},\mathbf{x}'))}$ and regress with it as in (3). This method gives you translation invariance. However, this definition of GP depends on the training data, since the center depends on data. The kernel function also depends on the data.

One might think that we can derive the GP from a probabilistic linear regression formulation which is translation invariant. However, it is not easy. To see why, we explicitly marginalize the ${w_0}$ as

$\displaystyle \begin{array}{rcl} p(\mathbf{y}|\mathbf{X})&=&\int p(\mathbf{y}|\mathbf{X},\mathbf{w},w_0)p(\mathbf{w})d\mathbf{w} p(w_0)dw_0=\int p(\mathbf{y}|\mathbf{X},w_0)p(w_0)dw_0\\ &=&\int \mathcal{N}(\mathbf{y}|w_0\mathbf{1},\alpha^{-1}\mathbf{X}^T\mathbf{X}+\beta^{-1}\mathbf{I})\mathcal{N}(w_0|0,\alpha_0^{-1})dw_0\\ &=&\mathcal{N}(\mathbf{y}|0,(\alpha^{-1}\mathbf{X}^T\mathbf{X}+\alpha_0^{-1}\mathbf{1}^T\mathbf{1})+\beta^{-1}\mathbf{I}) \end{array}$

As indicated in previous section, the data centering is equivalent to make ${\alpha_0=0}$. However by doing so, ${p(\mathbf{y}|\mathbf{X})}$ wont be a valid Gaussian, and you can not construct a GP from it.

To contruct a GP for regression which is translation invariant is superisingly non-trivial. How to solve the dilemma is still bothering me. I cannot figure out a way to have a GP, when used for regression, the solution of which is translation invariant. I hope there is a easier way to do it. Maybe the reality is that you can not! Any suggestion is welcome.