Undocumented Machine Learning (IV): Exponential Family

1. Exponential Family

The exponential family of distributions over {x}, given the parameter {\theta}, is defined to be the set of distributions of the form

\displaystyle p(x|\theta)=h(x)g(\theta)\exp\left(\theta^T\phi(x)\right)

or equivalently

\displaystyle p(x|\theta)=p_0(x)\exp\left(\theta^T\phi(x)-A(\theta)\right)

Here {\theta} is called the natural parameter of the distribution, and {\phi(x)} is some function of {x} called the sufficient statistic. The function {g(\theta)} is called partition function which is a normalization coefficient such that

\displaystyle g(\theta)\int h(x)\exp\left(\theta^T\phi(x)\right)dx=1

where the function

\displaystyle -\ln g(\theta)=\ln\int h(x)\exp\left(\theta^T\phi(x)\right)dx

is a convex function and has the property

\displaystyle  \begin{array}{rcl}  -\nabla_{\theta}\ln g(\theta)&=&\frac{\int \phi(x)h(x)\exp\left(\theta^T\phi(x)\right)dx}{\int h(x)\exp\left(\theta^T\phi(x)\right)dx}\\ &=&\int \phi(x)h(x)g(\theta)\exp\left(\theta^T\phi(x)\right)dx=E[\phi(x)] \end{array}

Therefore we have

\displaystyle  \begin{array}{rcl}  	E[\phi(x)]&=&-\nabla_{\theta}\ln g(\theta)=\mu(\theta)\\ 	\mathrm{Cov}[\phi(x)]&=&-\nabla_{\theta}^2\ln g(\theta) \end{array}

The likelihood of the distribution given data {X=\{x_i\}_{i=1}^n} is

\displaystyle p(X|\theta)=\prod_{i=1}^n h(x_i)g(\theta)^n\exp\left(\theta^T\sum_{i=1}^n \phi(x_i)\right)

The maximum likelihood estimation of the natural parameters is equivalent to solving following convex optimization problem

\displaystyle \min_{\theta}~~~ -\left<\frac{1}{n}\sum_{i=1}^n \phi(x_i),\theta\right>-\ln g(\theta)

The global optima is the solution of following equation

\displaystyle \theta=\mu^{-1}\left(\frac{1}{n}\sum_{i=1}^n \phi(x_i)\right)

For members of the exponential family, there exists a conjugate prior that can be written in the form

\displaystyle p(\theta|\chi^{\mathrm{o}},\nu^{\mathrm{o}})=f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})g(\theta)^{\nu^{\mathrm{o}}}\exp\left(\nu^{\mathrm{o}}\theta^T\chi^{\mathrm{o}}\right)

where {f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})} is a normalization coefficient. The posterior distribution is

\displaystyle p(\theta|X,\chi^{\mathrm{o}},\nu^{\mathrm{o}})\propto p(X|\theta)p(\theta|\chi^{\mathrm{o}},\nu^{\mathrm{o}})

which is in the same parametric form as the prior distribution

\displaystyle p(\theta|X,\chi^{\mathrm{o}},\nu^{\mathrm{o}})=p(\theta|\chi,\nu)=f(\chi,\nu)g(\theta)^{\nu}\exp\left(\nu\theta^T\chi\right)

where

\displaystyle  \begin{array}{rcl}  \nu&=&\nu^{\mathrm{o}}+n,\\ \chi&=&\frac{\nu^{\mathrm{o}} \chi^{\mathrm{o}}+\sum_{i=1}^n \phi(x_i)}{\nu^{\mathrm{o}}+n} \end{array}

The parameters {\nu^{\mathrm{o}}} and {\nu} can be interpreted as the effective numbers of pseudo-observations in the prior and the posterior respectively. {\chi^{\mathrm{o}}} and {\chi} are the averages of the effective observations. We will use the term prior hyperparameters to refer to {\eta^{\mathrm{o}}=\{\chi^{\mathrm{o}},\nu^{\mathrm{o}}\}}, and the term posterior hyperparameters to refer to {\eta=\{\chi,\nu\}}.

For the distribution of the exponential family with the conjugate prior, given observations {X=\{x_i\}_i}, we can also analytically evaluate the marginal likelihood (also known as model evidence) as

\displaystyle  \begin{array}{rcl}  p(X)&=&\int p(X|\theta)p(\theta|\chi^{\mathrm{o}},\nu^{\mathrm{o}})d\theta\\ &=&\int \prod_{i=1}^n h(x_i)g(\theta)^n\exp\left(\theta^T\sum_{i=1}^n \phi(x_i)\right)f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})g(\theta)^{\nu^{\mathrm{o}}}\exp\left(\nu^{\mathrm{o}}\theta^T\chi^{\mathrm{o}}\right) d\theta\\ &=&f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})\prod_{i=1}^n h(x_i)\int g(\theta)^{\nu^{\mathrm{o}}+n}\exp\left(\theta^T\left(\sum_{i=1}^n \phi(x_i)+\nu^{\mathrm{o}}\chi^{\mathrm{o}}\right)\right)d\theta\\ &=&f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})\prod_{i=1}^n h(x_i)\int g(\theta)^{\nu}\exp\left(\nu\theta^T\chi\right)d\theta\\ &=&\frac{f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})}{f(\chi,\nu)}\prod_{i=1}^n h(x_i) \end{array}

The predictive likelihood of a new observation {x^*} is given by

\displaystyle  \begin{array}{rcl}  p(x^*|X)&=&\int p(x^*|\theta)p(\theta|X,\chi^{\mathrm{o}},\nu^{\mathrm{o}})d\theta=\int p(x^*|\theta)p(\theta|\chi,\nu)d\theta\\ &=&\int h(x^*)g(\theta)\exp\left(\theta^T\phi(x^*)\right) f(\chi,\nu)g(\theta)^{\nu}\exp\left(\nu\theta^T\chi\right)d\theta\\ &=&f(\chi,\nu)h(x^*)\int g(\theta)^{\nu+1}\exp\left(\theta^T(\phi(x^*)+\nu\chi)\right)d\theta\\ &=&f(\chi,\nu)h(x^*)\int g(\theta)^{\nu^*}\exp\left(\nu^*\theta^T\chi^*\right)d\theta\\ &=&\frac{f(\chi,\nu)}{f(\chi^*,\nu^*)}h(x^*) \end{array}

where

\displaystyle  \begin{array}{rcl}  \nu^*&=&\nu+1\\ \chi^*&=&\frac{\nu\chi+\phi(x^*)}{\nu+1} \end{array}

The marginal distribution is usually not in exponential family.

2. Conjugate Gaussian Distribution

The density function of the Gaussian distribution is

\displaystyle p(x|\mu,\Lambda)=\mathcal{N}(x|\mu,\Lambda^{-1})=\frac{|\Lambda|^{1/2}}{(2\pi)^{d/2}}\exp\left(-\frac{1}{2}(x-\mu)^T\Lambda(x-\mu)\right).

where {\mu} and {\Lambda} are the mean and the precision of the Gaussian distribution. The likelihood is

\displaystyle p(X|\mu,\Lambda)=\prod_{i=1}^{n}p(x_i|\mu,\Lambda)=\left(\frac{|\Lambda|}{(2\pi)^{d}}\right)^{n/2}\exp\left(-\frac{1}{2}\sum_{i=1}^n(x_i-\mu)^T\Lambda(x_i-\mu)\right)

The maximum likelihood estimation of the parameters {\mu} and {\Lambda^{-1}} are the sample mean and covariance

\displaystyle  \begin{array}{rcl}  \bar{x}&=&\frac{1}{n}\sum_{i=1}^n x_i\\ \Lambda^{-1}&=&\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})(x_i-\bar{x})^T \end{array}

The conjugate prior for the parameter {\mu} of {\mathcal{N}(x|\mu,\Lambda^{-1})} is a Gaussian

\displaystyle p(\mu|m^{\mathrm{o}},\kappa^{\mathrm{o}})=\mathcal{N}(\mu|m^{\mathrm{o}},(\kappa^{\mathrm{o}}\Lambda)^{-1})

The posterior is

\displaystyle  \begin{array}{rcl}  p(\mu|X)&\propto& p(X|\mu)p(\mu|m^{\mathrm{o}},\kappa^{\mathrm{o}})\\ &\propto& \exp\left(-\frac{1}{2}\sum_{i=1}^n(x_i-\mu)^T\Lambda(x_i-\mu)\right) \exp\left(-\frac{\kappa^{\mathrm{o}}}{2}(\mu-m^{\mathrm{o}})^T\Lambda(\mu-m^{\mathrm{o}})\right)\\ &\propto&\exp\left(-\frac{1}{2}\left[\sum_{i=1}^n(x_i-\mu)^T\Lambda(x_i-\mu)+\kappa^{\mathrm{o}}(\mu-m^{\mathrm{o}})^T\Lambda(\mu-m^{\mathrm{o}})\right]\right)\\ &\propto&\exp\left(-\frac{\kappa^{\mathrm{o}}+n}{2}(\mu-m)^T\Lambda(\mu-m)\right) \end{array}

where

\displaystyle m=\frac{\kappa^{\mathrm{o}}m^{\mathrm{o}}+n\bar{x}}{\kappa^{\mathrm{o}}+n}

The posterior is again a Gaussian

\displaystyle p(\mu|X)=\mathcal{N}(\mu|m,(\kappa\Lambda)^{-1})

with the parameters

\displaystyle  \begin{array}{rcl}  \kappa&=&\kappa^{\mathrm{o}}+n\\ m&=&\frac{\kappa^{\mathrm{o}}m^{\mathrm{o}}+n\bar{x}}{\kappa^{\mathrm{o}}+n} \end{array}

The conjugate prior for both the parameters {\mu} and {\Lambda} of {\mathcal{N}(x|\mu,\Lambda^{-1})} is the Gaussian-Wishart distribution

\displaystyle p(\mu,\Lambda|m^{\mathrm{o}},\kappa^{\mathrm{o}}, T^{\mathrm{o}},\nu^{\mathrm{o}})=p(\mu|\Lambda)p(\Lambda)=\mathcal{N}(\mu|m^{\mathrm{o}},(\kappa^{\mathrm{o}}\Lambda)^{-1})\mathcal{W}(\Lambda|(T^{\mathrm{o}})^{-1},\nu^{\mathrm{o}})

where {\mathcal{W}(\Lambda|W,\nu)} is the Wishart distribution with the density function given by

\displaystyle \mathcal{W}(\Lambda|W,\nu)=B(W,\nu)|\Lambda|^{(\nu-d-1)/2}\exp\left(-\frac{1}{2}\mathrm{Tr}(W^{-1}\Lambda)\right)

where

\displaystyle B(W,\nu)=|W|^{-\nu/2}\left(2^{\nu d/2}\pi^{d(d-1)/4}\prod_{i=1}^d \Gamma\left(\frac{\nu+1-i}{2}\right)\right)^{-1}

The posterior is of the same parametric form as the prior

\displaystyle p(\mu,\Lambda|X,m,\kappa,T,\nu)=\mathcal{N}(\mu|m,(\kappa\Lambda)^{-1})\mathcal{W}(\Lambda|T^{-1},\nu)

where

\displaystyle  \begin{array}{rcl}  \kappa&=&\kappa^{\mathrm{o}}+n\\ m&=&\frac{\kappa^{\mathrm{o}}m^{\mathrm{o}}+n\bar{x}}{\kappa^{\mathrm{o}}+n}\\ \nu&=&\nu^{\mathrm{o}}+n\\ T&=&T^{\mathrm{o}}+nS+\frac{\kappa^{\mathrm{o}}n(\bar{x}-m^{\mathrm{o}})(\bar{x}-m^{\mathrm{o}})^T}{\kappa^{\mathrm{o}}+n}\\ \end{array}

The Bayesian posterior inference of conjugate Gaussian-Wishart model is given by

\displaystyle  \begin{array}{rcl}  p(x|X)&=&\int\int p(x|\mu,\Lambda)p(\mu,\Lambda|X)d\mu d\Lambda\\ &=&\int \int\mathcal{N}(x|\mu,\Lambda^{-1})\mathcal{N}(\mu|m,(\kappa\Lambda)^{-1})d\mu \mathcal{W}(\Lambda|T^{-1},\nu) d\Lambda\\ &=&\int\mathcal{N}(x|m,(1+\kappa^{-1})\Lambda^{-1})\mathcal{W}(\Lambda|T^{-1},\nu) d\Lambda\\ &=&\mathcal{T}(x|m,\frac{(\kappa+1)}{\kappa(\nu-d+1)}T,\nu-d+1) \end{array}

where {\mathcal{T}(x|\mu,\Sigma,\nu)} is a student-t distribution with the density function

\displaystyle \mathcal{T}(x|\mu,\Sigma,\nu)=\frac{\Gamma((\nu+d)/2)}{\Gamma(\nu/2)}\frac{1}{(\nu\pi)^{d/2}|\Sigma|^{1/2}}\left[1+\frac{1}{\nu}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]^{-(\nu+d)/2}

Advertisements

About statinfer

Statistical inference
This entry was posted in Machine Learning and tagged . Bookmark the permalink.

One Response to Undocumented Machine Learning (IV): Exponential Family

  1. Pingback: Undocumented Machine Learning | Machine Learning Rumination

Comments are closed.