Kmeans is the most simple and widely used clustering algorithm. Detail description can be found in the wiki page.
Given an initial set of k means, the algorithm proceeds by alternating between two steps until converge:
(1) Assignment step: Assign each sample point to the cluster with the closest mean.
(2) Update step: Calculate the new means to be the centroid of the sample points in the cluster.
The algorithm is deemed to have converged when the assignments no longer change.
There is a built-in kmeans function in Matlab. Again, it is far from efficient. The implementation in Matlab is naive. It is something like
while !converged for each point assign label end for each cluster compute mean end end
There are at least two layers of loops which hurt the efficiency badly. Here, we will use some tricks to get ride of the inner loops by verctorization.
The assignment step is to find the nearest mean to each point. Therefore, we can utilize the verctorized version of pairwise distance function we wrote in the previous post to find the nearest neighbors:
[~,label] = min(sqDistance(M,X),,1);
where M is the mean matrix and X is the sample matrix. the sqDistance function is just a one-liner:
function D = sqDistance(X, Y) D = bsxfun(@plus,dot(X,X,1)',dot(Y,Y,1))-2*(X'*Y);
For our purpose (find the nearest neighbor), we do not need to compute the dot product for sample points every time we compute the assignment. We can pre-compute it. Analyze a little further, we can see that we do not even need to compute it at all, since it does not affect the ranking. Therefore, we can write the assignment step efficiently as
[~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),,1);
Vectorizing the update step is a little tricky. We can first build a (K x n) indicator matrix E to indicate the membership of each point to each cluster. If sample i is in cluster k then E(k,i)=1/n(k), otherwise E(k,i)=0. Here n(k) is the number of samples in cluster k. Then the mean matrix M can be computed by X*E. The update step now can be written as:
E = sparse(1:n,label,1,n,k,n); % transform label into indicator matrix m = X*(E*spdiags(1./sum(E,1)',0,k,k)); % compute m of each cluster
Note that E is a sparse matrix. Matlab automatically optimizes the matrix multiplication between a sparse matrix and a dense matrix. It is far more efficient than multiplying two dense matrices if the sparse matrix is indeed sparse.
Putting everything together, we have a very concise implementation (10 lines of code):
function label = litekmeans(X, k) n = size(X,2); last = 0; label = ceil(k*rand(1,n)); % random initialization while any(label ~= last) E = sparse(1:n,label,1,n,k,n); % transform label into indicator matrix m = X*(E*spdiags(1./sum(E,1)',0,k,k)); % compute m of each cluster last = label; [~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),,1); % assign samples to the nearest centers end
The source code can be downloaded from here. This function is usually one or two order faster than the built-in Matlab kmeans function depending on the data set used. Again, the power of the vectorization is tremendous.
The takeaway of this post might be:
(2) Analyzing code to remove redundant computation.
(3) Matrix multiplication between a sparse matrix and a dense matrix is efficiently optimized by Matlab.