Efficient Matlab (II): Kmeans clustering algorithm

Kmeans is the most simple and widely used clustering algorithm. Detail description can be found in the wiki page.

Given an initial set of k means, the algorithm proceeds by alternating between two steps until converge:
(1) Assignment step: Assign each sample point to the cluster with the closest mean.
(2) Update step: Calculate the new means to be the centroid of the sample points in the cluster.
The algorithm is deemed to have converged when the assignments no longer change.

There is a built-in kmeans function in Matlab. Again, it is far from efficient. The implementation in Matlab is naive. It is something like

while !converged
  for each point
    assign label
  end
  for each cluster
    compute mean
  end
end

There are at least two layers of loops which hurt the efficiency badly. Here, we will use some tricks to get ride of the inner loops by verctorization.

The assignment step is to find the nearest mean to each point. Therefore, we can utilize the verctorized version of pairwise distance function we wrote in the previous post to find the nearest neighbors:

[~,label] = min(sqDistance(M,X),[],1);

where M is the mean matrix and X is the sample matrix. the sqDistance function is just a one-liner:

function D = sqDistance(X, Y)
D = bsxfun(@plus,dot(X,X,1)',dot(Y,Y,1))-2*(X'*Y);

For our purpose (find the nearest neighbor), we do not need to compute the dot product for sample points every time we compute the assignment. We can pre-compute it. Analyze a little further, we can see that we do not even need to compute it at all, since it does not affect the ranking. Therefore, we can write the assignment step efficiently as

[~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),[],1);

Vectorizing the update step is a little tricky. We can first build a (K x n) indicator matrix E to indicate the membership of each point to each cluster. If sample i is in cluster k then E(k,i)=1/n(k), otherwise E(k,i)=0. Here n(k) is the number of samples in cluster k. Then the mean matrix M can be computed by X*E. The update step now can be written as:

    E = sparse(1:n,label,1,n,k,n);  % transform label into indicator matrix
    m = X*(E*spdiags(1./sum(E,1)',0,k,k));    % compute m of each cluster

Note that E is a sparse matrix. Matlab automatically optimizes the matrix multiplication between a sparse matrix and a dense matrix. It is far more efficient than multiplying two dense matrices if the sparse matrix is indeed sparse.

Putting everything together, we have a very concise implementation (10 lines of code):

function label = litekmeans(X, k)
n = size(X,2);
last = 0;
label = ceil(k*rand(1,n));  % random initialization
while any(label ~= last)
    E = sparse(1:n,label,1,n,k,n);  % transform label into indicator matrix
    m = X*(E*spdiags(1./sum(E,1)',0,k,k));    % compute m of each cluster
    last = label;
    [~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),[],1); % assign samples to the nearest centers
end

The source code can be downloaded from here. This function is usually one or two order faster than the built-in Matlab kmeans function depending on the data set used. Again, the power of the vectorization is tremendous.

The takeaway of this post might be:
(1) Verctorizaton!
(2) Analyzing code to remove redundant computation.
(3) Matrix multiplication between a sparse matrix and a dense matrix is efficiently optimized by Matlab.

This entry was posted in Matlab and tagged Clustering, Kmeans, Matlab. Bookmark the permalink.

10 Responses to Efficient Matlab (II): Kmeans clustering algorithm

pecsbowen says:

April 27, 2012 at 4:45 pm

i am trying to run this with a .ppm file.
i have tried double(img) and then passing it to this but then you m=X* this line of code has errors for the mismatch of the matrix dimensions.
Any suggestions on how should i get a kmeans code running for a .ppm image? and how do i convert it into a passable input parameter?

Renato Cordeiro de Amorim says:

July 31, 2012 at 10:49 pm

Nice implementation!
If I’m not mistaken the kmeans function in the stats toolbox does a bit more. By default it runs an online phase after the batch k-means (hence the sub-optimal speed), just take a look in the help file and you will see what I’m talking about.

Lucy says:

June 30, 2014 at 7:37 am

How can we do a k means clustering of an anistropic diffused image in CIE LAB space ?

H. Birnbaum says:

November 1, 2014 at 2:02 pm

Hello,

could you please explain how
[~,label] = min(sqDistance(M,X),[],1);

with the sqDistance function you explained is equivalent to this

[~,label] = max(bsxfun(@minus,m’*X,dot(m,m,1)’/2),[],1);

I don’t see it and I also don’t get the same result if I compute the first and second. Why does the min becomes max?

Thanks for explanations!

- statinfer says:
  
  November 2, 2014 at 11:58 am
  
  Please take a look at previous post. sqDist(M,X)=-2mX+dot(m,m,1)’*1+1*dot(X,X,1), where the last part is a constant, which does not affect the order.
  
H. Birnbaum says:

November 3, 2014 at 7:25 am

yes, thank you. But why does the former search for the minimal distance becomes a search for the max? Sorry, maybe it is obvious but I don’t see it.. Thanks for your help!

- statinfer says:
  
  November 3, 2014 at 11:18 am
  
  [~,y]=min(x) is equivalent to [~,y]=max(-x)
  
  - Tessie says:
    
    March 27, 2017 at 12:23 pm
    
    Holy shiitzn, this is so cool thank you.
Pingback: 聚类一、（Centroid models） | RAMSEY
M. Catherall says:

September 18, 2015 at 10:34 am

Very nice, thanks for the code, and for the explanatory post, great stuff! As a slight extension, you can give the data points different weights (so that the means are equivalent to centres of mass) simply by replacing the ‘1’ in the creation of the sparse matrix with a vector of point weights, w:

E = sparse(1:n,label,w,n,k,n);

	Johann Deneux on Efficient F#: tail recursive q…
	Tessie on Efficient Matlab (II): Kmeans…
	Jesse on Efficient F#: tail recursive q…
	Mali on Efficient Matlab (I): pairwise…
	Mirkes on Efficient Matlab (I): pairwise…