4

I did a hierarchical cluster for a project. I have 300 observations each of 20 variables. I indexed all the variables so that each variable is between 0 and 1, a larger value being better.

I used the following code to create a cluster plot.

d_data <- dist(all_data[,-1])
d_data_ind <- dist(data_ind[,-1])
hc_data_ind <- hclust(d_data_ind, method = "complete")
dend<- as.dendrogram(hc_data_ind)
plot(dend)

Now the labels of the nodes are in row names, the numbers 1 to 300 (see top pic). During the analysis, I removed the first column of the data frame which is labeled "geography" (see bottom pic), because they were city names in text and would screw up the analysis. But I really need to get the city names on the cluster plot in their right spots, because I need to choose a list of cities based on the results.

What code should I write to insert the city names in the "geography" column into this plot, corresponding to their row names?

As you can see from the data frame (bottom pic), all the city names are in alphabetical order, neatly in ascending order, just like the row names. I'm sure it isn't hard to put the city names onto the plot, I just can't find it by googling and asking around.

enter image description here How to alter the label of the nodes? Right now it's numbers but I need them to be cities.

enter image description here

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
Elan
  • 539
  • 1
  • 6
  • 18
  • Please get used to provide reproducible code, ready to copy-paste-run, to make it easier for visitors & readers. (E.g. `all_data` is not given; screenshots of data sets are not helpful; providing the result of `dput(my_data)` is the way to go.) – lukeA Apr 06 '16 at 22:22
  • thanks for the advice, I will practice that in the future – Elan Apr 07 '16 at 03:36
  • [Why not improve your question now](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Jaap Apr 07 '16 at 06:58

2 Answers2

3

I think that what you are asking is "how can I decide on the labels in a dendrogram". So this has two parts. For example, let's use the simple data of the numbers c(1,2,5,6)

1) When you create the hclust using dist, it uses the names of the items. And if they don't exist then it uses a running index. For example:

x <- c(1,2,5,6)
d1 <- as.dendrogram(hclust(dist(x)))
plot(d1)

enter image description here

This is obviously a problem since the items we have are 1,2,5,6 and not 1:4! So how can we fix this? One way is update the names. For example:

x <- c(1,2,5,6)
names(x) <- x
x
d2 <- as.dendrogram(hclust(dist(x)))
plot(d2)

enter image description here

I believe this basically solves your problem (and frankly, doesn't require dendextend). But if you want to update the text AFTER creating the dendrogram - read on:

2) The dendextend package allows you to update the labels of a dendrogram. But you need to make sure you are using the correct order (since the order of the original vector, and that of the labels in the tree are not the same!). Here is how it can be done:

if (!require(dendextend)) install.packages(dendextend);
library(dendextend)
x <- c(1,2,5,6)
d3 <- as.dendrogram(hclust(dist(x)))
labels(d3) <- x[order.dendrogram(d3)]
plot(d3)

enter image description here

Here is how we would do it for a more complex data object (where we may not want to play with the row names of the object, but to update the dendrogram):

if (!require(dendextend)) install.packages(dendextend);
library(dendextend)
x <- CO2[,4:5]
d4 <- as.dendrogram(hclust(dist(x)))
labels(d4) <- apply(CO2[,1:3], 1, paste, collapse = "_")[order.dendrogram(d4)]

d4 <- set(d4, "labels_cex", 0.6)
d4 <- color_branches(d4, k = 3)
par(mar = c(3,0,0,6))
plot(d4, horiz = T)

enter image description here

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
2

You want the original labels instead of IDs? Maybe this helps you with your analysis:

data <- USArrests[1:5, ]
data <- cbind(label=row.names(data), data)
row.names(data) <- NULL
d <- dist(data[, -1])
hc <- hclust(d)
plot(hc)
rect.hclust(hc, h=40)

![enter image description here

data$label[order.dendrogram(as.dendrogram(hc))]
# [1] "Arkansas"   "Arizona"    "California" "Alabama"    "Alaska"  

clusters <- cutree(hc, h=40)
split(data$label, clusters)
# $`1`
# [1] "Alabama" "Alaska" 
# 
# $`2`
# [1] "Arizona"    "California"
# 
# $`3`
# [1] "Arkansas"

hc$labels <- data$label
plot(hc)

enter image description here

PS: I found it helpful to save dendrograms to pdf, where you can zoom in and out easily: pdf("my.pdf"); plot(hc); dev.off().

lukeA
  • 53,097
  • 5
  • 97
  • 100
  • tried this solution and returned error...ended up using a sketchbook to manually enter the character values according to the row numbers lol, but I'll poke around more when I have time – Elan Apr 07 '16 at 03:37
  • _"tried this solution and returned error.."_ - what exactly did you try, what error message did it return? You should edit your post and add the data plus the full code to reproduce your problem. Otherwise it's not possible to help. – lukeA Apr 07 '16 at 04:31