3

I have a data frame which I am trying to cluster. I am using hclust right now. In my data frame, there is a FLAG column which I would like to color the dendrogram by. By the resulting picture, I am trying to figure out similarities among various FLAG categories. My data frame looks something like this:

FLAG    ColA    ColB    ColC    ColD

I am clustering on colA, colB, colC and colD. I would like to cluster these and color them according to FLAG categories. Ex - color red if 1, blue if 0 (I have only two categories). Right now I am using the vanilla version of cluster plotting.

hc<-hclust(dist(data[2:5]),method='complete')
plot(hc)

Any help in this regard would be highly appreciated.

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
Patthebug
  • 4,647
  • 11
  • 50
  • 91

2 Answers2

2

If you want to color the branches of a dendrogram based on a certain variable then the following code (largely taken from the help for the dendrapply function) should give the desired result:

x<-1:100
dim(x)<-c(10,10)
groups<-sample(c("red","blue"), 10, replace=TRUE)

x.clust<-as.dendrogram(hclust(dist(x)))

local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
    }
    n
  }
  mycols <- groups
  i <- 0
})

x.clust.dend <- dendrapply(x.clust, colLab)
plot(x.clust.dend)
Arhopala
  • 376
  • 1
  • 7
  • I'm not sure if I am doing something wrong here but this doesn't work for me. I changed the `groups` variable to the column that I wanted an ran the exact same code replacing `x` with my own data frame. I get a usual dendrogram without colors. – Patthebug Apr 27 '14 at 23:07
  • Hi Pathebug, could you give a small example of your data frame? Note that in the above example that my groups variable is a character vector containing the colours "blue" and "red". In your case, your flag vector would thus also need to be a character vector of colours. If not, say your variable contains two groups, say apples and oranges, then you will need to create a new vector whereby apples and oranges refer to the colours you want.Flag<-sample(c("apples","oranges"), 10, replace=TRUE) Flag.colours<-gsub("apples","red", Flag) Flag.colours<-gsub("oranges","blue", Flag.colours) – Arhopala Apr 27 '14 at 23:27
  • You would then use Flag.colours to colour your branches. Hope this helps. – Arhopala Apr 27 '14 at 23:28
  • I have a similar question. However this code only works if you: 1) extract previously the order of the tips from the resulted cluster, 2) assign the tips to the categories and 3) assign a colour to each category. Is there any way to assign automatically the colours to the tips of the cluster based on a dataframe that contains the name of the tips IDs and its category or colour? As the `merge` function. In this way we don't have to extract the tips from the cluster. Thank you very much for this post! – Ruben Nov 15 '15 at 16:16
2

I think Arhopala's answer is good. I took the liberty to take a step further, and added the function assign_values_to_leaves_edgePar to the dendextend package (starting from version 0.17.2, which is now on github). This version of the function is a bit more robust and flexible from Arhopala's answer since:

  1. It is a general function which can work in different problems/settings
  2. The function can deal with other edgePar parameters (col, lwd, lty)
  3. The function offers recycling of partial vectors, and various warnings massages when needed.

To install the dendextend package you can use install.packages('dendextend'), but for the latest version, use the following code:

require2 <- function (package, ...) {
    if (!require(package)) install.packages(package); library(package)
}

## require2('installr')
## install.Rtools() # run this if you are using Windows and don't have Rtools installed (you must have it for devtools)

# Load devtools:
require2("devtools")
devtools::install_github('talgalili/dendextend')

Now that we have dendextend installed, here is a second take on Arhopala's answer:

x<-1:100
dim(x)<-c(10,10)
set.seed(1)
groups<-sample(c("red","blue"), 10, replace=TRUE)
x.clust<-as.dendrogram(hclust(dist(x)))

x.clust.dend <- x.clust
x.clust.dend <- assign_values_to_leaves_edgePar(x.clust.dend, value = groups, edgePar = "col") # add the colors.
x.clust.dend <- assign_values_to_leaves_edgePar(x.clust.dend, value = 3, edgePar = "lwd") # make the lines thick
plot(x.clust.dend)

Here is the result:

enter image description here

p.s.: I personally prefer using pipes for this type of coding (which will give the same result as above, but is easier to read):

x.clust <- x %>% dist  %>% hclust %>% as.dendrogram
x.clust.dend <- x.clust %>% 
   assign_values_to_leaves_edgePar(value = groups, edgePar = "col") %>% # add the colors.
   assign_values_to_leaves_edgePar(value = 3, edgePar = "lwd") # make the lines thick
plot(x.clust.dend)
Tal Galili
  • 24,605
  • 44
  • 129
  • 187