-1

I am using R to plot a dendrogram of a hierarchial clustering.

I have realised a hierarchical clustering of ~3000 elements. The plot of the corresponding tree is obviously super messy. These 3000 elements are clustered in 20 groups using the cutree function. What I want is to plot the tree by cluster (i.e. truncated at the nodes where each cluster originate labeled appropriately by cluster => a tree with 20 terminal leaves).

Thanks

O.

Oselm
  • 7
  • 3
  • Welcome to Stack Overflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – zx8754 Mar 28 '17 at 06:55

1 Answers1

2

You can try to reduce ylim to the corresponding height:

With random data:

set.seed(123)
testdata <- matrix(rnorm(300), ncol=3)
testcah <- hclust(dist(testdata))

The height for each step of the cah are in testdata$heights from first to last merge. If, for example, you want 5 groups, you need to know the 4th before last height:

floor_y <- rev(testcah$height)[5-1]

Then, making your object as a dendrogram, you can plot it only on the part you need:

testdend <- as.dendrogram(testcah)
plot(testdend, ylim=c(floor_y, attributes(testdend)$height))

If you want to label the branches with the clusters' labels, as defined by cutree, you need to get the labels (by reordering cutree result) and find where to put them along the x axis. This information can be obtained by "decomposing" the dendrogram to fin the needed midpoints.

First, get the labels of (all) the leaves:

testlab <- cutree(testcah, 5)[testcah$order]

Then we use a recursive function to find the midpoints of the subdendrograms that lies at the desired height:

find_x <- function(dendro, ordrecah, cutheight){
            if(!is.null(attributes(dendro)$leaf)) { # if the dendrogram is a leaf, just get its position in the global dendrogram
                return(which(ordrecah==attributes(dendro)$label))
            } else {
                if(attributes(dendro)$height<cutheight){ # if we're under the height threshold, get the midpoint
                    return(attributes(dendro)$midpoint)
                } else { # if we're above the height threshold, pass the function on the 2 subparts of the dendrogram
                    return(c(find_x(dendro[[1]], ordrecah, cutheight), find_x(dendro[[2]], ordrecah, cutheight)))
                }
            }
           }

So we can get the midpoints or leaf position with:

test_x <- find_x(testdend, testcah$order, floor_y)

but the midpoints correspond to the distance between the leftmost leaf and the node, so, in case of a cluster with more than one member, we need to add the distance from 1 to the leftmostleaf.

length_clus <- rle(testlab)$lengths # get the number of members by cluster
test_x[length_clus > 1] <- (test_x + head(c(1, cumsum(length_clus)+1), -1))[length_clus > 1]

Finally, put the labels on the plot:

mtext(side=1, at=test_x, line=0, text=unique(testlab))

enter image description here

Cath
  • 23,906
  • 5
  • 52
  • 86
  • Thanks Cath. But then how would you tell which cluster is which compared to the cutree function? How would you label on the plot? – Oselm Mar 24 '17 at 15:09
  • @Oselm please see the edit to place the clusters labels – Cath Mar 27 '17 at 08:18