1

I am looking for an algorithm that solves the following problem:

  • Given: a set of items and their similarity matrix.
  • Goal: group these items in "clusters" of minimum size m
  • Conditions:
    • There are no cluster-like structures in the dataset, as shown in Figure 1
    • Anyway, the items in a group should be similar to each other. Thus, the global similarity would be high.

The motivation is not to identify good clusters but to split a dataset into groups of high similarity and of minimum size. Partitioning around medoids does not work out-of-the box, it would produce a lot of 1-item-clusters. Hierarchical approaches (AGNES, DIANA) does not help either.

This problem is someway similar to Stable Marriage problems: one tries to rank the neighbored items by similarity. But here, there are at least m items in one group / one marriage.

Thanks in advance!

Figure 1.

CiaPan
  • 9,381
  • 2
  • 21
  • 35
lkegel
  • 193
  • 9
  • What about a simple k-means? Its objective function is to maximise the homogeneity/similarity of each cluster. It usually finds groups of similar size, and by defining the number of clusters a priori you can roughly obtain an average number of points/cluster. Alternative is to run k-means in two steps: i) get the centroids (local cluster centers), ii) assign the points based on the distance from each centroid with an added constrain that a minimum number of points should be allocated. Any thoughts? – TWL Jun 02 '16 at 18:41

2 Answers2

2

This is not clustering then. A clustering algorithm should tell your that there are no clusters there. Your task sounds more like bin packing, knapsack and similar optimization problems to me.

Without any further constraints, your problem is also underspecified.

Why don't you try a greedy heuristic (which is common to use for knapsack like problems). Choose any point at random, add enough points to satisfy your minimum size constraint.

Then choose the furthest point from this, and add again enough points to satisfy your minimum size. Repeat (using sum of distances for ranking), until you no longer can satisfy the minimum size. Then add every remaining point to the nearest cluster each. Finally, do an optimization pass moving points to other clusters as long as the minimum size is satisfied?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

While this is not a typical clustering task (see @Anony-Mousse), you can modify a clustering algorithm to suit your needs.

You can follow this Tutorial for Same-Size K-Means, or simply use this algorithm from the tutorial package/module in ELKI (build the latest version from GitHub, because I just fixed a bug there).

Essentially, this algorithm performs a k-means style least-squares optimization, but all clusters must have the same size (if N/k is not integer, the cluster sizes may vary by 1).

If you go to above tutorial and scroll to the bottom, you can see example results.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42