I'm trying to use hirearchical clustering (specifically hclust
) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut()
and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming.
I've experimented with the dynamicTreeCut
package, but can't figure out how to enter these (relatively simple) limitations. I'm using deepSplit
as the way to designate the number of groupings, but following the documentation, this limits the maximum number to 4. For the exercise below, all I'm looking to do is to get the clusters into 5 groups of 3 or more individuals (I can deal with the maximum size limitation on my own, but if you want to try to tackle this too, it would be helpful!).
Here's my example, using the Orange
dataset.
library(dynamicTreeCut)
library(reshape2)
##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)
####casting the data to make a correlation matrix, and then running
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))
###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)
dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
As you can see, it only designates two clusters. For my exercise, I want to shy away from using an explicit height term to cut the trees, as I want a k
number of trees instead.