## Undocumented Machine Learning (IV): Exponential Family

1. Exponential Family

The exponential family of distributions over ${x}$, given the parameter ${\theta}$, is defined to be the set of distributions of the form

$\displaystyle p(x|\theta)=h(x)g(\theta)\exp\left(\theta^T\phi(x)\right)$

or equivalently

$\displaystyle p(x|\theta)=p_0(x)\exp\left(\theta^T\phi(x)-A(\theta)\right)$

Here ${\theta}$ is called the natural parameter of the distribution, and ${\phi(x)}$ is some function of ${x}$ called the sufficient statistic. The function ${g(\theta)}$ is called partition function which is a normalization coefficient such that

$\displaystyle g(\theta)\int h(x)\exp\left(\theta^T\phi(x)\right)dx=1$

where the function

$\displaystyle -\ln g(\theta)=\ln\int h(x)\exp\left(\theta^T\phi(x)\right)dx$

is a convex function and has the property

$\displaystyle \begin{array}{rcl} -\nabla_{\theta}\ln g(\theta)&=&\frac{\int \phi(x)h(x)\exp\left(\theta^T\phi(x)\right)dx}{\int h(x)\exp\left(\theta^T\phi(x)\right)dx}\\ &=&\int \phi(x)h(x)g(\theta)\exp\left(\theta^T\phi(x)\right)dx=E[\phi(x)] \end{array}$

Therefore we have

$\displaystyle \begin{array}{rcl} E[\phi(x)]&=&-\nabla_{\theta}\ln g(\theta)=\mu(\theta)\\ \mathrm{Cov}[\phi(x)]&=&-\nabla_{\theta}^2\ln g(\theta) \end{array}$

The likelihood of the distribution given data ${X=\{x_i\}_{i=1}^n}$ is

$\displaystyle p(X|\theta)=\prod_{i=1}^n h(x_i)g(\theta)^n\exp\left(\theta^T\sum_{i=1}^n \phi(x_i)\right)$

The maximum likelihood estimation of the natural parameters is equivalent to solving following convex optimization problem

$\displaystyle \min_{\theta}~~~ -\left<\frac{1}{n}\sum_{i=1}^n \phi(x_i),\theta\right>-\ln g(\theta)$

The global optima is the solution of following equation

$\displaystyle \theta=\mu^{-1}\left(\frac{1}{n}\sum_{i=1}^n \phi(x_i)\right)$

For members of the exponential family, there exists a conjugate prior that can be written in the form

$\displaystyle p(\theta|\chi^{\mathrm{o}},\nu^{\mathrm{o}})=f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})g(\theta)^{\nu^{\mathrm{o}}}\exp\left(\nu^{\mathrm{o}}\theta^T\chi^{\mathrm{o}}\right)$

where ${f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})}$ is a normalization coefficient. The posterior distribution is

$\displaystyle p(\theta|X,\chi^{\mathrm{o}},\nu^{\mathrm{o}})\propto p(X|\theta)p(\theta|\chi^{\mathrm{o}},\nu^{\mathrm{o}})$

which is in the same parametric form as the prior distribution

$\displaystyle p(\theta|X,\chi^{\mathrm{o}},\nu^{\mathrm{o}})=p(\theta|\chi,\nu)=f(\chi,\nu)g(\theta)^{\nu}\exp\left(\nu\theta^T\chi\right)$

where

$\displaystyle \begin{array}{rcl} \nu&=&\nu^{\mathrm{o}}+n,\\ \chi&=&\frac{\nu^{\mathrm{o}} \chi^{\mathrm{o}}+\sum_{i=1}^n \phi(x_i)}{\nu^{\mathrm{o}}+n} \end{array}$

The parameters ${\nu^{\mathrm{o}}}$ and ${\nu}$ can be interpreted as the effective numbers of pseudo-observations in the prior and the posterior respectively. ${\chi^{\mathrm{o}}}$ and ${\chi}$ are the averages of the effective observations. We will use the term prior hyperparameters to refer to ${\eta^{\mathrm{o}}=\{\chi^{\mathrm{o}},\nu^{\mathrm{o}}\}}$, and the term posterior hyperparameters to refer to ${\eta=\{\chi,\nu\}}$.

For the distribution of the exponential family with the conjugate prior, given observations ${X=\{x_i\}_i}$, we can also analytically evaluate the marginal likelihood (also known as model evidence) as

$\displaystyle \begin{array}{rcl} p(X)&=&\int p(X|\theta)p(\theta|\chi^{\mathrm{o}},\nu^{\mathrm{o}})d\theta\\ &=&\int \prod_{i=1}^n h(x_i)g(\theta)^n\exp\left(\theta^T\sum_{i=1}^n \phi(x_i)\right)f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})g(\theta)^{\nu^{\mathrm{o}}}\exp\left(\nu^{\mathrm{o}}\theta^T\chi^{\mathrm{o}}\right) d\theta\\ &=&f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})\prod_{i=1}^n h(x_i)\int g(\theta)^{\nu^{\mathrm{o}}+n}\exp\left(\theta^T\left(\sum_{i=1}^n \phi(x_i)+\nu^{\mathrm{o}}\chi^{\mathrm{o}}\right)\right)d\theta\\ &=&f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})\prod_{i=1}^n h(x_i)\int g(\theta)^{\nu}\exp\left(\nu\theta^T\chi\right)d\theta\\ &=&\frac{f(\chi^{\mathrm{o}},\nu^{\mathrm{o}})}{f(\chi,\nu)}\prod_{i=1}^n h(x_i) \end{array}$

The predictive likelihood of a new observation ${x^*}$ is given by

$\displaystyle \begin{array}{rcl} p(x^*|X)&=&\int p(x^*|\theta)p(\theta|X,\chi^{\mathrm{o}},\nu^{\mathrm{o}})d\theta=\int p(x^*|\theta)p(\theta|\chi,\nu)d\theta\\ &=&\int h(x^*)g(\theta)\exp\left(\theta^T\phi(x^*)\right) f(\chi,\nu)g(\theta)^{\nu}\exp\left(\nu\theta^T\chi\right)d\theta\\ &=&f(\chi,\nu)h(x^*)\int g(\theta)^{\nu+1}\exp\left(\theta^T(\phi(x^*)+\nu\chi)\right)d\theta\\ &=&f(\chi,\nu)h(x^*)\int g(\theta)^{\nu^*}\exp\left(\nu^*\theta^T\chi^*\right)d\theta\\ &=&\frac{f(\chi,\nu)}{f(\chi^*,\nu^*)}h(x^*) \end{array}$

where

$\displaystyle \begin{array}{rcl} \nu^*&=&\nu+1\\ \chi^*&=&\frac{\nu\chi+\phi(x^*)}{\nu+1} \end{array}$

The marginal distribution is usually not in exponential family.

2. Conjugate Gaussian Distribution

The density function of the Gaussian distribution is

$\displaystyle p(x|\mu,\Lambda)=\mathcal{N}(x|\mu,\Lambda^{-1})=\frac{|\Lambda|^{1/2}}{(2\pi)^{d/2}}\exp\left(-\frac{1}{2}(x-\mu)^T\Lambda(x-\mu)\right).$

where ${\mu}$ and ${\Lambda}$ are the mean and the precision of the Gaussian distribution. The likelihood is

$\displaystyle p(X|\mu,\Lambda)=\prod_{i=1}^{n}p(x_i|\mu,\Lambda)=\left(\frac{|\Lambda|}{(2\pi)^{d}}\right)^{n/2}\exp\left(-\frac{1}{2}\sum_{i=1}^n(x_i-\mu)^T\Lambda(x_i-\mu)\right)$

The maximum likelihood estimation of the parameters ${\mu}$ and ${\Lambda^{-1}}$ are the sample mean and covariance

$\displaystyle \begin{array}{rcl} \bar{x}&=&\frac{1}{n}\sum_{i=1}^n x_i\\ \Lambda^{-1}&=&\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})(x_i-\bar{x})^T \end{array}$

The conjugate prior for the parameter ${\mu}$ of ${\mathcal{N}(x|\mu,\Lambda^{-1})}$ is a Gaussian

$\displaystyle p(\mu|m^{\mathrm{o}},\kappa^{\mathrm{o}})=\mathcal{N}(\mu|m^{\mathrm{o}},(\kappa^{\mathrm{o}}\Lambda)^{-1})$

The posterior is

$\displaystyle \begin{array}{rcl} p(\mu|X)&\propto& p(X|\mu)p(\mu|m^{\mathrm{o}},\kappa^{\mathrm{o}})\\ &\propto& \exp\left(-\frac{1}{2}\sum_{i=1}^n(x_i-\mu)^T\Lambda(x_i-\mu)\right) \exp\left(-\frac{\kappa^{\mathrm{o}}}{2}(\mu-m^{\mathrm{o}})^T\Lambda(\mu-m^{\mathrm{o}})\right)\\ &\propto&\exp\left(-\frac{1}{2}\left[\sum_{i=1}^n(x_i-\mu)^T\Lambda(x_i-\mu)+\kappa^{\mathrm{o}}(\mu-m^{\mathrm{o}})^T\Lambda(\mu-m^{\mathrm{o}})\right]\right)\\ &\propto&\exp\left(-\frac{\kappa^{\mathrm{o}}+n}{2}(\mu-m)^T\Lambda(\mu-m)\right) \end{array}$

where

$\displaystyle m=\frac{\kappa^{\mathrm{o}}m^{\mathrm{o}}+n\bar{x}}{\kappa^{\mathrm{o}}+n}$

The posterior is again a Gaussian

$\displaystyle p(\mu|X)=\mathcal{N}(\mu|m,(\kappa\Lambda)^{-1})$

with the parameters

$\displaystyle \begin{array}{rcl} \kappa&=&\kappa^{\mathrm{o}}+n\\ m&=&\frac{\kappa^{\mathrm{o}}m^{\mathrm{o}}+n\bar{x}}{\kappa^{\mathrm{o}}+n} \end{array}$

The conjugate prior for both the parameters ${\mu}$ and ${\Lambda}$ of ${\mathcal{N}(x|\mu,\Lambda^{-1})}$ is the Gaussian-Wishart distribution

$\displaystyle p(\mu,\Lambda|m^{\mathrm{o}},\kappa^{\mathrm{o}}, T^{\mathrm{o}},\nu^{\mathrm{o}})=p(\mu|\Lambda)p(\Lambda)=\mathcal{N}(\mu|m^{\mathrm{o}},(\kappa^{\mathrm{o}}\Lambda)^{-1})\mathcal{W}(\Lambda|(T^{\mathrm{o}})^{-1},\nu^{\mathrm{o}})$

where ${\mathcal{W}(\Lambda|W,\nu)}$ is the Wishart distribution with the density function given by

$\displaystyle \mathcal{W}(\Lambda|W,\nu)=B(W,\nu)|\Lambda|^{(\nu-d-1)/2}\exp\left(-\frac{1}{2}\mathrm{Tr}(W^{-1}\Lambda)\right)$

where

$\displaystyle B(W,\nu)=|W|^{-\nu/2}\left(2^{\nu d/2}\pi^{d(d-1)/4}\prod_{i=1}^d \Gamma\left(\frac{\nu+1-i}{2}\right)\right)^{-1}$

The posterior is of the same parametric form as the prior

$\displaystyle p(\mu,\Lambda|X,m,\kappa,T,\nu)=\mathcal{N}(\mu|m,(\kappa\Lambda)^{-1})\mathcal{W}(\Lambda|T^{-1},\nu)$

where

$\displaystyle \begin{array}{rcl} \kappa&=&\kappa^{\mathrm{o}}+n\\ m&=&\frac{\kappa^{\mathrm{o}}m^{\mathrm{o}}+n\bar{x}}{\kappa^{\mathrm{o}}+n}\\ \nu&=&\nu^{\mathrm{o}}+n\\ T&=&T^{\mathrm{o}}+nS+\frac{\kappa^{\mathrm{o}}n(\bar{x}-m^{\mathrm{o}})(\bar{x}-m^{\mathrm{o}})^T}{\kappa^{\mathrm{o}}+n}\\ \end{array}$

The Bayesian posterior inference of conjugate Gaussian-Wishart model is given by

$\displaystyle \begin{array}{rcl} p(x|X)&=&\int\int p(x|\mu,\Lambda)p(\mu,\Lambda|X)d\mu d\Lambda\\ &=&\int \int\mathcal{N}(x|\mu,\Lambda^{-1})\mathcal{N}(\mu|m,(\kappa\Lambda)^{-1})d\mu \mathcal{W}(\Lambda|T^{-1},\nu) d\Lambda\\ &=&\int\mathcal{N}(x|m,(1+\kappa^{-1})\Lambda^{-1})\mathcal{W}(\Lambda|T^{-1},\nu) d\Lambda\\ &=&\mathcal{T}(x|m,\frac{(\kappa+1)}{\kappa(\nu-d+1)}T,\nu-d+1) \end{array}$

where ${\mathcal{T}(x|\mu,\Sigma,\nu)}$ is a student-t distribution with the density function

$\displaystyle \mathcal{T}(x|\mu,\Sigma,\nu)=\frac{\Gamma((\nu+d)/2)}{\Gamma(\nu/2)}\frac{1}{(\nu\pi)^{d/2}|\Sigma|^{1/2}}\left[1+\frac{1}{\nu}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]^{-(\nu+d)/2}$