3

I am clustering a distance matrix based on a 20,000 row x 169 column data set in R using hclust(). When I convert the cluster object to a dendrogram and plot the entire dendrogram, it is difficult to read because it is so large, even if I output it to a fairly large pdf.

df <- as.data.frame(matrix(abs(rnorm(3380000)), nrow = 20000))
mydist <- vegdist(df)
my.hc <- hclust(mydist, method = "average")
hcd <- as.dendrogram(my.hc)

pdf("hclust_plot.pdf", width = 40, height = 15)
plot(hcd)
dev.off()

I would like to specify the number of clusters (k) at which to truncate the dendrogram, then plot only the upper portion of the dendrogram above the k split points. I know I can plot the upper portion based on specifying a height (h) using the function cut().

pdf("hclust_plot2.pdf", width = 40, height = 15)
plot(cut(hcd, h = 0.99)$upper)
dev.off()

I also know I can use the dendextend package to color the dendrogram plot with the k groups.

library(dendextend)
pdf("hclust_plot3.pdf", width = 40, height = 15)
plot(color_branches(hcd, k = 44))
dev.off()

But for my data set, this dendrogram is too dense to even read which group is which color. Is there a way to plot only the upper portion of the dendrogram above the cut point by specifying k, not h? Or is there a way to get the h value for a dendrogram, given k?

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
jk22
  • 95
  • 1
  • 1
  • 8
  • Without delving too deep into this, this project might be of interest to you: https://github.com/thomasp85/ggraph – boshek Jan 15 '16 at 20:52
  • 1
    This SO question offers quite a bit for you, it seems, and references the dendenxtend package: http://stackoverflow.com/questions/31124810/r-cut-dendrogram-into-groups-with-minimum-size – lawyeR Jan 16 '16 at 02:18

1 Answers1

2

You can use the heights_per_k.dendrogram function from the dendextend package, to get the heights for various k cuts.

For example:

## Not run: 
hc <- hclust(dist(USArrests[1:4,]), "ave")
dend <- as.dendrogram(hc)

library(dendextend)
dend_h <- heights_per_k.dendrogram(dend)
par(mfrow = c(1,2))
plot(dend)
plot(dend, ylim = c(dend_h["3"], dend_h["1"]))

enter image description here

And in your case:

set.seed(2016-01-16)
df <- as.data.frame(matrix(abs(rnorm(2*20000)), nrow = 20000))
mydist <- dist(df)
my.hc <- hclust(mydist, method = "average")
hcd <- as.dendrogram(my.hc)

library(dendextend)
library(dendextendRcpp)
dend_h <- heights_per_k.dendrogram(hcd) # (this can take some time)
plot(hcd, ylim = c(dend_h["43"], dend_h["1"]))

enter image description here

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • Thanks for your help. I know how to color the branches using color_branches() but do you know how I can label these branches so that I can read which group is which? – jk22 Jan 26 '16 at 19:03
  • Use the "groupLabels = TRUE" parameter in color_branches – Tal Galili Jan 27 '16 at 11:27
  • Thanks @Tal When I try 'd1 <- color_branches(hcd, k = 43, groupLabels = TRUE)', then 'plot(d1, ylim = c(dend_h["43"], dend_h["1"]))' some of the labels do not appear, I assume because they're lower than the lowest height plotted. Is there a way to control where those labels appear? – jk22 Jan 27 '16 at 19:48