0

I have a finite number of points (cloud), with a metric defined on them. I would like to find the maximum amount of clusters in this cloud such that:

1) the maximum distance between any two points in one cluster is smaller a given epsilon (const)

2) each cluster has exactly k (const) points in it

I looked at all kind of different clustering methods and clustering with a restriction on the inner maximum distance is not a problem (density based). The 2) constrain and the requirement to find "the maximum amount of clusters s.t." seem to be problematic though. Any suggestions for an efficient solution?

Thank you, A~

aZen
  • 223
  • 1
  • 8

2 Answers2

1

Given your constraints, there might be no solution. And actually, that may happen quite often...

The most obvious case is when you don't have a multiple of k points.

But also if epsilon is set too low, there might be points that cannot be put into clusters anymore.

I think you need to rethink your requirements and problem, instead of looking for an algorithm to solve an unreasonably hard requirement that might not be satisfiable.

Also consider whether you really need to find the guaranteed maximum, or just a good solution.

There are some rather obvious approaches that will at least find a good approximation fast.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    Could you specify the obvious approaches a bit? To me they are (unfortunately) not obvious. Also if the constraints are not met (no solution) the epsilon value would be increased and the query re-run (if more than k points). – aZen Feb 16 '13 at 19:00
1

I have the same impression as @Anony-Mousse, actually: you have not understood your problem and requirements yet.

If you want your cluster sizes to be k, there is no question of how many clusters you will get: it's obviously n /k. So you can try to use a k-means variant that produces clusters of the same size as e.g. described in this tutorial: Tutorial on same-size k-means and set the desired number of cluster to n/k.

Note that this is not a particular sensible or good clustering algorithm. It does something to satisfy the constraints, but the clusters are not really meaningful from a cluster analysis point of view. It's constraint satisfaction, but not cluster analysis.

In order to also satisfy your epsilon constraint, you can then start off with this initial solution (which probably is what @Anony-Mousse referred to as "obvious approaches") and try to perform the same kind of optimization-by-swapping-elements in order to satisfy the epsilon condition.

You may need a number of restarts, because there may be no solution.

See also:

Group n points in k clusters of equal size

K-means algorithm variation with equal cluster size

for essentially redundant questions.

Community
  • 1
  • 1
Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • Thanks for the reply. Most points will not belong to any cluster, so clustering into k clusters of equal size does not help me (I actually found those questions you linked before). I might be able to use DBScan with my epsilon value and then split the clusters that are >= 2*k. That sounds like it could work! – aZen Feb 17 '13 at 20:48
  • 1
    I get the impression you are **not looking for cluster analysis**, but instead for a variant of the [Maximum set cover problem](https://en.wikipedia.org/wiki/Maximum_coverage_problem). See: clustering tries to find structure, whereas you have a predefined structure, and look for the maximum cover possible using this structure. – Erich Schubert Feb 19 '13 at 07:27
  • Thank you! That last reply really helped me understanding my problem better. – aZen Feb 20 '13 at 10:46