5

I would like to calculate how good the fit of my cluster analysis solution for the actual distance scores is. To do that, I need to extract the distance between the stimuli I am clustering. I know that when looking at the dendrogram I can extract the distance, for example between 5 and -14 is .219 (the height of where they are connected), but is there an automatic way of extracting the distances from the information in the hclust object?

List of 7
 $ merge      : int [1:14, 1:2] -5 -1 -6 -4 -10 -2 1 -9 -12 -3 ...
 $ height     : num [1:14] 0.219 0.228 0.245 0.266 0.31 ...
 $ order      : int [1:15] 3 11 5 14 4 1 8 12 10 15 ...
 $ labels     : chr [1:15] "1" "2" "3" "4" ...
 $ method     : chr "ward.D"
 $ call       : language hclust(d = as.dist(full_naive_eucAll, diag = F, upper = F), method = "ward.D")
 $ dist.method: NULL
 - attr(*, "class")= chr "hclust"
Tal Galili
  • 24,605
  • 44
  • 129
  • 187
Esther
  • 441
  • 2
  • 15

1 Answers1

1

Yes. You are asking about the cophenetic distance.

d_USArrests <- dist(USArrests)
hc <- hclust(d_USArrests, "ave")
par(mfrow = c(1,2))
plot(hc)
plot(cophenetic(hc) ~ d_USArrests)
cor(cophenetic(hc), d_USArrests)

enter image description here

The same method can also be applied to compare two hierarchical clustering methods, and is implemented in the dendextend R package (the function makes sure the two distance matrix are ordered to match). For example:

# install.packages('dendextend')
library("dendextend")

d_USArrests <- dist(USArrests)
hc1 <- hclust(d_USArrests, "ave")
hc2 <- hclust(d_USArrests, "single")
cor_cophenetic(hc1, hc2)
#  0.587977
Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • is there a reason the cophentic distance is discrete (ie many values at ~ 55, 75, 85, 151, but none inbetween? – Esther Feb 17 '16 at 21:54
  • I'm going to answer my own question: it's because the cophenetic distance only takes into account the height of the link that merges two objects – Esther Feb 17 '16 at 22:08