R: How to get clusters of roughly the same size from dendrogram

Question

I tried to group students by their interests. The groups should have roughly the same size, even if this means that some students don't really share interests with their group members if they don't fit into any of the groups.

I used R's hclust() function and got a really nice dendrogram - so that works perfectly - but when I try to set clusters using cutree(), I can either adjust h (the height of the tree) or k (the desired group size). The problem is that even if I set my group size to a certain value, I get some groups that are way smaller.

If you look at the plotted tree, there are some students whose interests are completely different from those of the rest, so I guess that's the reason why it happens.

What I'd like to do to prevent this, is to "forbid" groups of a certain minimum size, so if there are such groups they are added to another small group or something like that. Is there an easy way to do this or do I have to write my own function to clean up a bit after the clustering?

I found similar questions on StackOverflow (e.g. this one) but they're all not flagged as answered and in the particular case I mentioned, I'm afraid I don't really get the proposed solution.

Thanks in advance for your input!

Merle

If you don't really need a "hierarchical" clustering method, but just want to cluster your students in equal-sized groups, you might look at the function `balanced_clustering()` from the `anticlust` package. (Or `matching()` in the same package, which pretty much does the same but you can specify the size of the groups rather than the number of clusters). — M. Papenberg, Nov 02 '20 at 14:18

score 1 · Accepted Answer · answered Nov 02 '20 at 17:45

As Merle noted in a comment, the solution does not have to be based on a hierarchical clustering method.

You can use the function balanced_clustering() from the anticlust package to create clusters of equal size. This is an example using the iris data set:

library(anticlust)

data(iris)

iris$group <- balanced_clustering(
  iris[, -5],
  K = nrow(iris) / 5 # 5 plants per group
)

The output is a vector indicating group membership. For example, this is one group of similar plants:

subset(iris, group == 1)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species group
#> 1           5.1         3.5          1.4         0.2  setosa     1
#> 5           5.0         3.6          1.4         0.2  setosa     1
#> 8           5.0         3.4          1.5         0.2  setosa     1
#> 18          5.1         3.5          1.4         0.3  setosa     1
#> 40          5.1         3.4          1.5         0.2  setosa     1

Note that I used the four numeric criteria for clustering, not the "Species".

The same can be done using anticlust::matching() where you specify the size of the groups, however:

matching(iris[, -5], p = 5)

R: How to get clusters of roughly the same size from dendrogram

1 Answers1