25

Possible Duplicate:
K-means algorithm variation with equal cluster size

EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that cover this one: https://stats.stackexchange.com/questions/8744/clustering-procedure-where-each-cluster-has-an-equal-number-of-points

My requirements

In a project I need to group n points (x,y) in k clusters of equal size (n / k). Where x and y are double floating numbers, n can range from 100 to 10000 and k can range from 2 to 100. Also k is known before the algorithm runs.

My experimentations

I started to resolve the problem by using the http://en.wikipedia.org/wiki/K-means_clustering algorithm, which work great and fast to produce exactly k clusters of roughly the same size.

But my problem is this, K-means produce clusters of roughly the same size, where I need the clusters to be exactly the same size (or to be more precise: I need them to have a size between floor(n / k) and ceil(n / k)).

Before you point it out to me, yes I tried the first answer here K-means algorithm variation with equal cluster size, which sounds like a good idea.

The main idea is to post process the array of cluster produce by K-means. From the biggest cluster up to the smallest. We reduce the size of the clusters that have more than n / k members by moving extra points to an other nearest cluster. Leaving alone the clusters that are already reduced.

Here is the pseudo code I implemented:

n is the number of point
k is the number of cluster
m = n / k (the ideal cluster size)
c is the array of cluster after K-means
c' = c sorted by size in descending order
for each cluster i in c' where i = 1 to k - 1
    n = size of cluster i - m (the number of point to move)
    loop n times
        find a point p in cluster i with minimal distance to a cluster j in c' where j > i
        move point p from cluster i to cluster j
    end loop
    recalculate centroids
end for each

The problem with this algorithm is that near the end of the process (when i come close to k), we have to choose a cluster j in c' (where j > i because we need to leave alone the clusters already processed), but this cluster j we found can be far from cluster i, thus breaking the concept of cluster.

My question

Is there a post K-means algorithm or a K-means variant that can meet my requirements, or am I wrong from the beginning and I need to find an other clustering algorithm?

PS: I do not mind to implement the solution myself, but it would be great if I can use a library, and ideally in JAVA.

Community
  • 1
  • 1
Pierre-David Belanger
  • 1,004
  • 1
  • 11
  • 19
  • How do you choose your initial clusters? – mvds Jan 09 '12 at 23:54
  • The number of clusters and their initial centroids are chosen by a user (human). – Pierre-David Belanger Jan 10 '12 at 00:03
  • What is your **optimality criterion**? I do not think that using and then "fixing" k-means results is the way to go. You can modify k-means to ensure the size remains within your constraints. – Has QUIT--Anony-Mousse Jan 10 '12 at 07:16
  • Optimality criterion? I am using K-means, so the smalest euclidean distance between a point and a centroid. As for the whole solution it should be, I guess, to have all point in its nearest equal size group. When you say "_You can modify k-means to ensure the size remains within your constraints._", yes this is implicit in my question, I will edit it to be more precise about that. Thank you. – Pierre-David Belanger Jan 10 '12 at 14:55
  • 2
    @casperOne, you closed this question as duplicate? I actually said in my question that I tried the proposed solution in http://stackoverflow.com/questions/5452576/k-means-algorithm-variation-with-equal-cluster-size without success, and I am trying to go further here, asking if there are other solutions. But if you decide to keep it closed, I will not be mad at you :) I just do not think it is a duplicate. – Pierre-David Belanger Jan 10 '12 at 21:25
  • 1
    @Pierre-DavidBelanger: Your question doesn't do a good job of differentiating itself from the other question. If you can make a significant change to point out the differences (in the question, not the comments) then it *could* be reopened. Also, have you considered asking on http://stats.stackexchange.com? This question is somewhat off topic given the existence of that site. – casperOne Jan 10 '12 at 21:27
  • @casperOne: ok, you are probably right, I will leave this question as is / closed. Sorry, I was not aware of the existence of stats.stackexchange.com, indeed I should have asked (and will ask) my question there. Thank you. – Pierre-David Belanger Jan 10 '12 at 21:34
  • @Pierre-DavidBelanger No need to be sorry, it's what us mods are here for. – casperOne Jan 10 '12 at 21:35
  • 1
    Have you tried all the answers to the existing question? For example the tutorial at https://elki-project.github.io/tutorial/same-size_k_means ? It is java, and it appears to satisfy your requirements? – Erich Schubert May 15 '17 at 18:26

2 Answers2

8

Try this k-means variation:

Initialization:

  • choose k centers from the dataset at random, or even better using kmeans++ strategy
  • for each point, compute the distance to its nearest cluster center, and build a heap for this
  • draw points from the heap, and assign them to the nearest cluster, unless the cluster is already overfull. If so, compute the next nearest cluster center and reinsert into the heap

In the end, you should have a paritioning that satisfies your requirements of the +-1 same number of objects per cluster (make sure the last few clusters also have the right number. The first m clusters should have ceil objects, the remainder exactly floor objects.) Note that using a heap ensures the clusters remain convex: if they were no longer convex, there would have been a better swap candidate.

Iteration step:

Requisites: a list for each cluster with "swap proposals" (objects that would prefer to be in a different cluster).

E step: compute the updated cluster centers as in regular k-means

M step: Iterating through all points (either just one, or all in one batch)

Compute nearest cluster center to object / all cluster centers that are closer than the current clusters. If it is a different cluster:

  • If the other cluster is smaller than the current cluster, just move it to the new cluster
  • If there is a swap proposal from the other cluster (or any cluster with a lower distance), swap the two element cluster assignments (if there is more than one offer, choose the one with the largest improvement)
  • otherwise, indicate a swap proposal for the other cluster

The cluster sizes remain invariant (+- the ceil/floor difference), an objects are only moved from one cluster to another as long as it results in an improvement of the estimation. It should therefore converge at some point like k-means. It might be a bit slower (i.e. more iterations) though.

I do not know if this has been published or implemented before. It's just what I would try (if I would try k-means. there are much better clustering algorithms.)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Interesting! I like your idea of a "swap proposal" list by cluster. I will try that. Also, you say "_there are much better clustering algorithms_": I am not bound to K-means, I am very open to try other/better clustering algorithms that can help me meet my requirements (n points in k clusters of equal size). So can you point me some of those _better clustering algorithms_ that can make equal size clusters? – Pierre-David Belanger Jan 10 '12 at 21:01
  • I don't know one for this specific task. I haven't had a need for equal size so far. – Has QUIT--Anony-Mousse Jan 10 '12 at 21:59
  • It seems to me that you would want to make sure that points aren't being assigned to non-adjacent clusters, so that the resulting Voronoi cells are still convex, but I don't think your algorithm does this. – acjay Nov 19 '12 at 22:19
  • As long as it moves the best candidates first, this is not an issue. If the cluster were no longer convex, there would be a better candidate for switching. But maybe I should emphasize that obviously the candidates with the largest gain should be moved. – Has QUIT--Anony-Mousse Nov 19 '12 at 23:33
  • Note that the initialisation step does not guarantee convexity. The heap just ensures that each point added to a cluster is farther away than all the other points, but that could very well create concave seams or even disconnected chunks. – sam hocevar Mar 22 '16 at 15:04
3

Not being an expert on the topic, I once needed to come up with a simple algorithm to cluster locations on a map, where every point needed to be part of a cluster, and clusters were bound in several ways (not just in size (i.e. point count), but also in some other measures which depended on varying factors).

By first finding the "difficult" points, and then growing clusters from there, I got the best results. "difficult" points would be points that are hard to reach, e.g. because they would lie alone in the outskirts of the total area, or because they would help hit another cluster boundary condition more than other points. This helped to grow neatly aligning clusters, leaving very little loners and corresponding handwork to place them.

This may help you if your current algorithm would normally find these difficult points last.

mvds
  • 45,755
  • 8
  • 102
  • 111
  • Interesting. Indeed when the user place the initial centroids on those "difficult" points (so they are added first to a cluster and their cluster grow from there), K-means is able to produce a nice layout of clusters, but sadly those clusters are again of different size (point count). I already tested that. Thank you. Did you used an existing library/algorithm to meet your needs? – Pierre-David Belanger Jan 10 '12 at 00:41
  • No I just implemented it quick&dirty by hand, it started out as a simple proof of concept, performance was no issue. You should grow your clusters point by point, i.e. in every round, grow each cluster by one. Then the point count cannot diverge. – mvds Jan 10 '12 at 00:43
  • Haha, yes, I tried that: growing all the clusters by one nearest point in a loop until no point are alone. This leeds to the following problem (near the end of the algorithm): let say we have the last point to place, this point is near to the left of cluster A, but cluster A is already full, so we have to choose cluster B which is to the right of cluster A, B will now contains a point that should have belong to A. – Pierre-David Belanger Jan 10 '12 at 00:51
  • That sounds like something easily solved in a simple post-processing step. Something like: On the boundary between A and B, find the points with the weakest links to cluster A and B respectively, and let them switch sides if it would increase overall energy. You can count the number of switches thus needed to get to the point of highest energy in order to benchmark the first step(s) of your algorithm. – mvds Jan 10 '12 at 01:19