Kmeans is the most simple and widely used clustering algorithm. Detail description can be found in the wiki page.

Given an initial set of k means, the algorithm proceeds by alternating between two steps until converge:

(1) Assignment step: Assign each sample point to the cluster with the closest mean.

(2) Update step: Calculate the new means to be the centroid of the sample points in the cluster.

The algorithm is deemed to have converged when the assignments no longer change.

There is a built-in kmeans function in Matlab. Again, it is far from efficient. The implementation in Matlab is naive. It is something like

while !converged for each point assign label end for each cluster compute mean end end

There are at least two layers of loops which hurt the efficiency badly. Here, we will use some tricks to get ride of the inner loops by verctorization.

The assignment step is to find the nearest mean to each point. Therefore, we can utilize the verctorized version of pairwise distance function we wrote in the previous post to find the nearest neighbors:

[~,label] = min(sqDistance(M,X),[],1);

where M is the mean matrix and X is the sample matrix. the sqDistance function is just a one-liner:

function D = sqDistance(X, Y) D = bsxfun(@plus,dot(X,X,1)',dot(Y,Y,1))-2*(X'*Y);

For our purpose (find the nearest neighbor), we do not need to compute the dot product for sample points every time we compute the assignment. We can pre-compute it. Analyze a little further, we can see that we do not even need to compute it at all, since it does not affect the ranking. Therefore, we can write the assignment step efficiently as

[~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),[],1);

Vectorizing the update step is a little tricky. We can first build a (K x n) indicator matrix E to indicate the membership of each point to each cluster. If sample i is in cluster k then E(k,i)=1/n(k), otherwise E(k,i)=0. Here n(k) is the number of samples in cluster k. Then the mean matrix M can be computed by X*E. The update step now can be written as:

E = sparse(1:n,label,1,n,k,n); % transform label into indicator matrix m = X*(E*spdiags(1./sum(E,1)',0,k,k)); % compute m of each cluster

Note that E is a sparse matrix. Matlab automatically optimizes the matrix multiplication between a sparse matrix and a dense matrix. It is far more efficient than multiplying two dense matrices if the sparse matrix is indeed sparse.

Putting everything together, we have a very concise implementation (10 lines of code):

function label = litekmeans(X, k) n = size(X,2); last = 0; label = ceil(k*rand(1,n)); % random initialization while any(label ~= last) E = sparse(1:n,label,1,n,k,n); % transform label into indicator matrix m = X*(E*spdiags(1./sum(E,1)',0,k,k)); % compute m of each cluster last = label; [~,label] = max(bsxfun(@minus,m'*X,dot(m,m,1)'/2),[],1); % assign samples to the nearest centers end

The source code can be downloaded from here. This function is usually one or two order faster than the built-in Matlab kmeans function depending on the data set used. Again, the power of the vectorization is tremendous.

The takeaway of this post might be:

(1) Verctorizaton!

(2) Analyzing code to remove redundant computation.

(3) Matrix multiplication between a sparse matrix and a dense matrix is efficiently optimized by Matlab.

i am trying to run this with a .ppm file.

i have tried double(img) and then passing it to this but then you m=X* this line of code has errors for the mismatch of the matrix dimensions.

Any suggestions on how should i get a kmeans code running for a .ppm image? and how do i convert it into a passable input parameter?

Nice implementation!

If I’m not mistaken the kmeans function in the stats toolbox does a bit more. By default it runs an online phase after the batch k-means (hence the sub-optimal speed), just take a look in the help file and you will see what I’m talking about.

How can we do a k means clustering of an anistropic diffused image in CIE LAB space ?

Hello,

could you please explain how

[~,label] = min(sqDistance(M,X),[],1);

with the sqDistance function you explained is equivalent to this

[~,label] = max(bsxfun(@minus,m’*X,dot(m,m,1)’/2),[],1);

I don’t see it and I also don’t get the same result if I compute the first and second. Why does the min becomes max?

Thanks for explanations!

Please take a look at previous post. sqDist(M,X)=-2mX+dot(m,m,1)’*1+1*dot(X,X,1), where the last part is a constant, which does not affect the order.

yes, thank you. But why does the former search for the minimal distance becomes a search for the max? Sorry, maybe it is obvious but I don’t see it.. Thanks for your help!

[~,y]=min(x) is equivalent to [~,y]=max(-x)

Holy shiitzn, this is so cool thank you.

Pingback: 聚类一、（Centroid models） | RAMSEY

Very nice, thanks for the code, and for the explanatory post, great stuff! As a slight extension, you can give the data points different weights (so that the means are equivalent to centres of mass) simply by replacing the ‘1’ in the creation of the sparse matrix with a vector of point weights, w:

E = sparse(1:n,label,w,n,k,n);