In this post, I will show how directly derive the Gaussian process formulation of nonlinear regression from probabilistic linear regression model. Doing this, some common misunderstanding of GP will be clear.

First, for reference purpose, let’s write down the probabilistic formulation for linear regression.

** 0.1. Probabilistic Linear Regression **

For a linear model

where the noise follows , the likelihood is

Assuming the priors is

The posterior then is

where

Given new samle , the predictive distribution of a value is

where

These are textbook illustration which can be found in PRML.

To deal with the bias term for a linear model

one way is to use the augment variable trick . Then applying above formulation is equivalent to use a prior with for . It does not make much sense to do so. As indicated before, By incorporating this prior, you lose the translation invariant property. To fix this, we can use non-informative prior for , where we simply make . Then we can derive the posterior

** 0.2. Gaussian Process **

Textbooks often directly present you the Gaussian process, then show the predictive distribution of GP is actually equivalent to linear regression when linear kernel used. However, since there is equivalence, we should be able to derive from one to another. Usually your textbook does not tell you how.

Here is the PRML way to present the GP (where we switch the symbols for consistency sake)

Then marginalizing out you get

where . However, this argument is not strictly correct, since the kernel matrix will almost certainly be singular. Therefore the distribution (1) will be ill-defined. Actually, we can derive the Gaussian process from linear regression without involving the ill-defined distribution. In Bayesian inference, we all know that given likelihood , we assume prior to derive the posterior , then inference by marginalize the parameter

Actually, there is another way that we can first marginalize parameter by

By integrate out the parameter , the samples are no more independent, this is called explain away effect. Then we can do inference by

To derive GP formulation of regression from linear regression, one way is to first marginalize out from , the marginal likelihood is

This marginal likelihood is equivalent to (2). Therefore you can see, in order to establish the exact equivalence, the kernel matrix should be . The predictive distribution then is

where

Then you can substitute the inner product in above formulation with kernel to get nonlinear regression.

Be careful when you see a Gaussian process , the kernel here is not the usual kernel you used in other kernel methods. It has to be positive definite, otherwise the marginal distribution wont be a valid distribution. To be more clear, maybe we should use something like , where , and the kernel then is the usual kernel used in non-probabilistic kernel methods.

Dealing with bias in GP is very tricky. One way is to use the augment kernel as suggested in previous post. Then define a Gaussian process and regress with it as in (3). Again, you can not expect translation invariance by using this.

Another way is to use the centering kernel

then define a Gaussian process and regress with it as in (3). This method gives you translation invariance. However, this definition of GP depends on the training data, since the center depends on data. The kernel function also depends on the data.

One might think that we can derive the GP from a probabilistic linear regression formulation which is translation invariant. However, it is not easy. To see why, we explicitly marginalize the as

As indicated in previous section, the data centering is equivalent to make . However by doing so, wont be a valid Gaussian, and you can not construct a GP from it.

To contruct a GP for regression which is translation invariant is superisingly non-trivial. How to solve the dilemma is still bothering me. I cannot figure out a way to have a GP, when used for regression, the solution of which is translation invariant. I hope there is a easier way to do it. Maybe the reality is that you can not! Any suggestion is welcome.

Pingback: Undocumented Machine Learning | Machine Learning Rumination