Efficient Matlab (II): Kmeans clustering algorithm

Kmeans is the most simple and widely used clustering algorithm. Detail description can be found in the wiki page.

Given an initial set of k means, the algorithm proceeds by alternating between two steps until converge:
(1) Assignment step: Assign each sample point to the cluster with the closest mean.
(2) Update step: Calculate the new means to be the centroid of the sample points in the cluster.
The algorithm is deemed to have converged when the assignments no longer change.

There is a built-in kmeans function in Matlab. Again, it is far from efficient. The implementation in Matlab is naive. It is something like

while !converged
  for each point
    assign label
  end
  for each cluster
    compute mean
  end
end

There are at least two layers of loops which hurt the efficiency badly. Here, we will use some tricks to get ride of the inner loops by verctorization.

The assignment step is to find the nearest mean to each point. Therefore, we can utilize the verctorized version of pairwise distance function we wrote in the previous post to find the nearest neighbors:

[~,label] = min(sqDistance(M,X),[],1);

where M is the mean matrix and X is the sample matrix. the sqDistance function is just a one-liner:

function D = sqDistance(X, Y)
D = bsxfun(@plus,dot(X,X,1)',dot(Y,Y,1))-2*(X'*Y);

For our purpose (find the nearest neighbor), we do not need to compute the dot product for sample points every time we compute the assignment. We can pre-compute it. Analyze a little further, we can see that we do not even need to compute it at all, since it does not affect the ranking. Therefore, we can write the assignment step efficiently as

[~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),[],1);

Vectorizing the update step is a little tricky. We can first build a (K x n) indicator matrix E to indicate the membership of each point to each cluster. If sample i is in cluster k then E(k,i)=1/n(k), otherwise E(k,i)=0. Here n(k) is the number of samples in cluster k. Then the mean matrix M can be computed by X*E. The update step now can be written as:

    E = sparse(1:n,label,1,n,k,n);  % transform label into indicator matrix
    m = X*(E*spdiags(1./sum(E,1)',0,k,k));    % compute m of each cluster

Note that E is a sparse matrix. Matlab automatically optimizes the matrix multiplication between a sparse matrix and a dense matrix. It is far more efficient than multiplying two dense matrices if the sparse matrix is indeed sparse.

Putting everything together, we have a very concise implementation (10 lines of code):

function label = litekmeans(X, k)
n = size(X,2);
last = 0;
label = ceil(k*rand(1,n));  % random initialization
while any(label ~= last)
    E = sparse(1:n,label,1,n,k,n);  % transform label into indicator matrix
    m = X*(E*spdiags(1./sum(E,1)',0,k,k));    % compute m of each cluster
    last = label;
    [~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),[],1); % assign samples to the nearest centers
end

The source code can be downloaded from here. This function is usually one or two order faster than the built-in Matlab kmeans function depending on the data set used. Again, the power of the vectorization is tremendous.

The takeaway of this post might be:
(1) Verctorizaton!
(2) Analyzing code to remove redundant computation.
(3) Matrix multiplication between a sparse matrix and a dense matrix is efficiently optimized by Matlab.

About statinfer

Statistical inference
This entry was posted in Matlab and tagged , , . Bookmark the permalink.

10 Responses to Efficient Matlab (II): Kmeans clustering algorithm

  1. pecsbowen says:

    i am trying to run this with a .ppm file.
    i have tried double(img) and then passing it to this but then you m=X* this line of code has errors for the mismatch of the matrix dimensions.
    Any suggestions on how should i get a kmeans code running for a .ppm image? and how do i convert it into a passable input parameter?

  2. Renato Cordeiro de Amorim says:

    Nice implementation!
    If I’m not mistaken the kmeans function in the stats toolbox does a bit more. By default it runs an online phase after the batch k-means (hence the sub-optimal speed), just take a look in the help file and you will see what I’m talking about.

  3. Lucy says:

    How can we do a k means clustering of an anistropic diffused image in CIE LAB space ?

  4. H. Birnbaum says:

    Hello,

    could you please explain how
    [~,label] = min(sqDistance(M,X),[],1);

    with the sqDistance function you explained is equivalent to this

    [~,label] = max(bsxfun(@minus,m’*X,dot(m,m,1)’/2),[],1);

    I don’t see it and I also don’t get the same result if I compute the first and second. Why does the min becomes max?

    Thanks for explanations!

  5. H. Birnbaum says:

    yes, thank you. But why does the former search for the minimal distance becomes a search for the max? Sorry, maybe it is obvious but I don’t see it.. Thanks for your help!

  6. Pingback: 聚类一、(Centroid models) | RAMSEY

  7. M. Catherall says:

    Very nice, thanks for the code, and for the explanatory post, great stuff! As a slight extension, you can give the data points different weights (so that the means are equivalent to centres of mass) simply by replacing the ‘1’ in the creation of the sparse matrix with a vector of point weights, w:

    E = sparse(1:n,label,w,n,k,n);

Leave a reply to Renato Cordeiro de Amorim Cancel reply