5

Is there an easy way to calculate lowest value of h in cut that produces groupings of a given minimum size?

In this example, if I wanted clusters with at least ten members each, I should go with h = 3.80:

# using iris data simply for reproducible example
data(iris)
d <- data.frame(scale(iris[,1:4]))
hc <- hclust(dist(d))
plot(hc)

cut(as.dendrogram(hc), h=3.79) # produces 5 groups; group 4 has 7 members

cut(as.dendrogram(hc), h=3.80) # produces 4 groups; no group has <10 members

Since the heights of the splits are given in hc$height, I could create a set of candidate values using hc$height + 0.00001 and then loop through cuts at each of them. However, I don't see how to parse the cluster size members out of the dendrogram class. For example, cut(as.dendrogram(hc), h=3.80)$lower[[1]]$members returns NULL, not 66 as desired.

Please note that this is a simpler question than Cutting dendrogram into n trees with minimum cluster size in R which uses the package dynamicTreeCut; here I am not specifying number of trees, just minimum cluster size. TYVM.

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • 1
    as a side note: members is an attribute - `attr(cut(as.dendrogram(hc), h=3.80)$lower[[1]], "members")` yields 66. – lukeA Jun 29 '15 at 20:35

3 Answers3

3

Thanks to @Vlo and @lukeA I'm able to implement a loop. However, I am just posting this for a starting point and certainly open to a more elegant solution.

unnest <- function(x) { # from Vlo's answer
  if(is.null(names(x))) x
  else c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))
}

cuts <- hc$height + 1e-9

min_size <- 10
smallest <- 0
i <- 0

while(smallest < min_size & i <= length(cuts)){
  h_i <- cuts[i <- i+1]
  if(i > length(cuts)){
    warning("Couldn't find a cluster big enough.")
  }
  else  smallest <- 
           Reduce(min, 
                  lapply(X = unnest(cut(as.dendrogram(hc), h=h_i)$lower), 
                         FUN = attr, which = "members") ) # from lukeA's comment
}
h_i # returns desired output: [1] 3.79211
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
2

This feature is available in the dendextend package with the heights_per_k.dendrogram function (which also has a faster C++ implementation when loading the dendextendRcpp function).

## Not run: 
hc <- hclust(dist(USArrests[1:4,]), "ave")
dend <- as.dendrogram(hc)
heights_per_k.dendrogram(dend)
##       1        2        3        4
##86.47086 68.84745 45.98871 28.36531

As a sidenote, the dendextend package has a cutree.dendrogram S3 method for dendrograms (which works very similarly to cutree for hclust objects).

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • thank you for the answer (and for creating the dendextend package, +1). I have rolled back your tag edit because my question is not a [dendextend] question per se, and while it may be convenient, [dendextend] is not the only solution. [Context](http://meta.stackoverflow.com/questions/252079/tagging-a-question-based-on-its-answers) – C8H10N4O2 Jul 11 '15 at 15:50
  • Hello @C8H10N4O2. I'm glad you like dendextend :) (any chance for a "V" vote?). As for removing the tag - I see the meta you linked to. In my opinion, since dendextend is a package designed to answer such questions - I would prefer that when people look at the dendextend tag they would find this solution. i.e.: http://meta.stackexchange.com/questions/26913/should-i-retag-a-question-with-a-tag-that-is-based-on-the-answer-and-not-the-que/26914#26914 Since this is your Q - it is obviously your decision. – Tal Galili Jul 11 '15 at 16:45
1

This doesn't answer the question, but might be useful for members extraction if you decide to loop through the h.

Stealing and modifying some code from here

# Unnest the list/dendogram structure
unnest <- function(x) {
  if(is.null(names(x))) {
    x
  }
  else {
    c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))
  }
}

# Extract the `members` attribute from each dendogram
lapply(X = unnest(cut(as.dendrogram(hc), h=3.8)), FUN = attr, which = "members")

Output:

# Please don't ask me why there are 2 dendograms stored
# in the `$upper` list while `print` displays one

$upper1
[1] 2

$upper2
[1] 2

$lower1
[1] 66

$lower2
[1] 11

$lower3
[1] 24

$lower4
[1] 49
Community
  • 1
  • 1
Vlo
  • 3,168
  • 13
  • 27