## Undocumented Machine Learning (V): Naive Bayes

In this blog post, I will show that the decision function of Naive Bayes is the same as logistic regression for observation from exponential family.

1. Naive Bayes with Gaussian

The basic assumption of Naive Bayes is ${p(\mathbf{x}|y_k)=\prod_jp(x_j|y_k)}$. The posterior is

$\displaystyle p(y_k|\mathbf{x})=\frac{p(y_k)p(\mathbf{x}|y_k)}{\sum_k p(y_k)p(\mathbf{x}|y_k)}=\frac{p(y_k)\prod_j p(x_j|y_k)}{\sum_kp(y_k)\prod_j p(x_j|y_k)}$

Usually in books, ${p(\mathbf{x}|y_k)}$ is a multinomial distribution. However, there is nothing preventing us from using other distributions, as long as variables of different dimensions are independent. For example, for continuous variable, we can use Gaussian

$\displaystyle p(x_j|y_k)=\mathcal{N}(x_j|\mu_{kj},\sigma_{kj})=\frac{1}{\sigma_{kj}\sqrt{2\pi}}\exp\left(-\frac{(x_j-\mu_{kj})^2}{2\sigma_{kj}^2}\right)$

$\displaystyle p(y_k)=\pi_k$

$\displaystyle \begin{array}{rcl} p(y_k)p(\mathbf{x}|y_k)&=&p(y_k)\prod_j p(x_j|y_k)=\exp\left(\ln p(y_k)+\sum_j\ln p(x_j|y_k)\right)\\ &=&\exp\left(\ln \pi_k-\sum_j\frac{\mu_{j}^2}{2\sigma_{kj}^2}+\sum_j\frac{\mu_{kj}}{\sigma_{kj}^2}x_j\right)\prod_j\frac{1}{\sigma_{kj}\sqrt{2\pi}}\exp\left(-\frac{x_j^2}{2\sigma_{kj}^2}\right)\\ &=&\exp\left(\ln\pi_k-\frac{1}{2}\sum_j\sigma_{kj}^2w_{kj}^2+\mathbf{w}_k^T\mathbf{x}\right)p_k(\mathbf{x}) \end{array}$

where

$\displaystyle w_{kj}=\frac{\mu_{kj}}{\sigma_{kj}^2}$

and

$\displaystyle p_k(\mathbf{x})=\prod_j\frac{1}{\sigma_{kj}\sqrt{2\pi}}\exp\left(-\frac{x_j^2}{2\sigma_{kj}^2}\right)$

The posterior is

$\displaystyle p(y_k|\mathbf{x})=\frac{p(y_k)p(\mathbf{x}|y_k)}{\sum_k p(y_k)p(\mathbf{x}|y_k)}$

If ${\sigma_{kj}=\sigma_j}$ for all ${k}$, then all ${p_k^o(\mathbf{x})}$ are canceled. The posterior becomes

$\displaystyle p(y_k|\mathbf{x})=\frac{\exp(w_{k0}+\mathbf{w}_k^T\mathbf{x})}{\sum_k\exp(w_{k0}+\mathbf{w}_k^T\mathbf{x})}$

where

$\displaystyle w_{k0}=\ln\pi_k-\frac{1}{2}\sum_j\sigma_j^2w_{kj}^2$

We can see Naive Bayes is a linear model, whose decision function is the same as logistic regression, given that the variance is the same for all classes.

2. Naive Bayes with Exponential Family

Actually, the result in previous section is not an accident. In general, we can use any distribution from the exponential family of such a form for Naive Bayes

$\displaystyle p(\mathbf{x}|y_k)=\exp(\boldsymbol{\theta}_k^T\mathbf{x}-A(\boldsymbol{\theta}_k))p_o(\mathbf{x}) \ \ \ \ \ (1)$

This is the sub-family of the general exponential family introduced in previous blog post. It is also widely used, such as in generalized linear model. This family of distribution has some special properties. First, variables of all dimensions are independent, which meets the requirement of Naive Bayes. Second, the sufficient statistics is only the first order momentum

$\displaystyle E[\mathbf{x}]=\nabla A(\boldsymbol{\theta}_k)=\mu(\boldsymbol{\theta}_k)$

The maximum likelihood estimation of ${\boldsymbol{\theta}}$ is simply

$\displaystyle \boldsymbol{\theta}_k=\mu^{-1}\left(\frac{1}{n_k}\sum_{i\in y_k}\mathbf{x}_i\right)$

The posterior again is

$\displaystyle p(y_k|\mathbf{x})=\frac{\exp(w_{k0}+\mathbf{w}_k^T\mathbf{x})}{\sum_k\exp(w_{k0}+\mathbf{w}_k^T\mathbf{x})}$

where

$\displaystyle \mathbf{w}_k=\boldsymbol{\theta}_k$

and

$\displaystyle w_{k0}=\ln\pi_k-A(\boldsymbol{\theta}_k)$

From these results, we can see that, for Naive Bayes, the decision boundary is always linear, if the conditional distribution of observations are of exponential family (1).