Undocumented Machine Learning (VI): Logistic Regression

In this blog post, I will discuss the logistic regression model and show some thought of mine regarding the relation between logistic regression and naive Bayes.

1. Logistic Regression

The binary classifier of logistic regression is to assume the label {y\in\{0,1\}} following the Bernoulli distribution

\displaystyle  p(y|\mathbf{x},\mathbf{w},w_0)= \pi^{y}(1-\pi)^{1-y}


\displaystyle  \pi=\sigma(\mathbf{w}^T\mathbf{x}+w_0)

The logistic sigmoid function is

\displaystyle  \sigma(a)=\frac{1}{1+\exp(-a)}

Given a data set {(\mathbf{y},\mathbf{X})} of independent samples, the solution of the model can be found by maximise the log-likelihood with respect to the parameter {\mathbf{w}} and {w_0}.

\displaystyle  \max_{\mathbf{w},w_0} ~\ln p(\mathbf{y}|\mathbf{X},\mathbf{w},w_0)


\displaystyle  \ln p(\mathbf{y}|\mathbf{X},\mathbf{w},w_0)=\sum_{i=1}^n\ln p(t_i|\mathbf{x}_i,\mathbf{w},w_0)

The classification boundary of logistic classifier is a linear hyperplane. However, the optimization problem could be ill posed. For example, if the data are lying in a subspace, there will be infinite many hyperplanes in the ambient space cutting through the same linear boundary in the subspace. To overcome the problem, a prior can be introduced for {\mathbf{w}}, where the Gaussian prior is mostly used

\displaystyle  p(\mathbf{w})=\mathcal{N}(\mathbf{w}|0,1/\sqrt\lambda)=\sqrt{\frac{\lambda}{2\pi}}\exp\left(-\frac{\lambda}{2}\|\mathbf{w}\|^2\right)

The posterior then is

\displaystyle  \begin{array}{rcl}  p(\mathbf{w}|\mathbf{X},\mathbf{y},w_0)&\propto &p(\mathbf{y}|\mathbf{X},\mathbf{w},w_0)p(\mathbf{w})\\ &\propto & \prod_{i=1}^n y_i^{t_i}(1-y_i)^{1-t_i}\exp\left(-\frac{\lambda}{2}\|\mathbf{w}\|^2\right) \end{array}

The solution can be found from the MAP estimation

\displaystyle  \max_{\mathbf{w},w_0} ~\ln p(\mathbf{w}|\mathbf{y},\mathbf{X},w_0)

which is equivalent to minimizing such a regularized error function

\displaystyle  -\sum_{i=1}^n[t_i\ln y_i+(1-t_i)\ln(1-y_i)]+\frac{\lambda}{2}\|\mathbf{w}\|^2

Note that here we place penalty only on {\mathbf{w}} not {w_0}. If {\lambda} is fixed, such an optimization target will ensure the solution of {\mathbf{w}} is translation invariant. Majority of literatures overlooked this fact that they also place a prior over {w_0}, which is wrong.

For multiclass problem, we use the indicator vector {\mathbf{y}=[y_k]_{k=1}^K,y_k\in\{0,1\}} as label. {\mathbf{y}} is assumed following the discrete distribution (a special case of multinomial distribution)

\displaystyle  p(\mathbf{y}|\mathbf{x},\mathbf{W},\mathbf{w}_0)=\prod_{k=1}^K\pi_k^{y_k}

where {\mathbf{W}=[\mathbf{w}_k]_{k=1}^K}, {\mathbf{w}_0=[w_{k0}]_{k=1}^K} and

\displaystyle  p(y_k|\mathbf{x})=\pi_k=\frac{\exp(\mathbf{w}_k^T\mathbf{x}+w_{k0})}{\sum_k \exp(\mathbf{w}_k^T\mathbf{x}+w_{k0})} \ \ \ \ \ (1)

The model in eq(1) is also called log linear model.

2. Discussion

I don’t want to open the deep generative vs discriminant discussion here (maybe in later post), just show some of my thought. From previous post, we can see that the decision function of naive Bayes is the same as logistic regression eq(1). Both of them are log linear model. It means that both methods operate in the same functional space. The difference is that they use different methods to find the solution points in that space. Logistic regression minimizes an error function derived from eq(1), where naive Bayes computes an integral derived from eq(1). It is usually believed that logistic regression makes less restrictive assumption about the data, therefore it is more general than naive Bayes. However, from what I observed, this belief is not necessarily true. Since the functional form of pdf they use is the same, they should have the similar discriminant power.

In naive Bayes model, the actual data violating the assumption does not mean that the solution is invalid for the data. Different assumption of data (e.g. Gaussian and multinomial) may lead to the same decision boundary. Invalid assumption may lead to an valid solution. For example, your data are actually generated from a Gaussian. However, you made the multinomial assumption to get a solution. This solution may turn out to be (approximately) the same solution when Gaussian assumption is made with your naive Bayes model. That is why, although the assumption in naive Bayes is not quite right, you can still have a good classifier for your data. Based on the same reason, we also can argue that even the data violate the feature independent assumption, the solution of naive Bayes might still be a valid one. This phenomenon might attribute to fact that although naive Bayes starts with the feature independent assumption, in the end it is log linear model which is more than feature independent. Though starting from something wrong to get something right is not the right thing to do in per se.

On the other hand, although logistic regression does not make the assumption of feature independent, the log linear assumption already restricts it to work on a small subset of the data of which the features are correlated. For example, given two classes of Gaussian data, if the optimal decision boundary is linear, the features of two classes of data must correlated in the same way (same covariance matrices). For those correlated data, the naive Bayes might still work.

The conclusion here is that we can not tell which one is better by simply looking at the assumptions. What really matters is the fact that both models are log linear eq(1).


About statinfer

Statistical inference
This entry was posted in Machine Learning and tagged . Bookmark the permalink.