0

I am working with 1800 observations to classify them. I apply a dendrogram analysis in which I represent the data. I identify three groups. The problem comes when it comes to visualizing the data. They are not readable. At the bottom, there is a lot of overlapping data. The labels are numbers, but I don't know how I managed to make them more readable. I have tried two options and neither is fruitful.

Option 1:

m  <- as.matrix(dtm)

distMatrix <- dist(m, method="euclidean")

groups <- hclust(distMatrix,method="ward.D")

clustering <- cutree(groups,3)

plot(groups, hang = -100, cex = 1, labels=FALSE)
rect.hclust(groups, k=3)

enter image description here

Option 2:

    m  <- as.matrix(dtm)
    
    distMatrix <- dist(m, method="euclidean")
    
    groups <- hclust(distMatrix,method="ward.D")
    
fviz_dend(groups, cex = 0.8, lwd = 0.8, k = 3, 
          rect = TRUE, 
          k_colors = "jco", 
          rect_border = "jco", 
          rect_fill = TRUE,
          ggtheme = theme_gray(),labels=F)

enter image description here

How can I represent the dendrogram without so much overlapping data appearing at the bottom? It looks very confusing with so much data together.

David Perea
  • 139
  • 3
  • 12
  • Well, what exactly do you want to happen? If you don't know what you want, then this really isn't a specific programming question that's appropriate for Stack Overflow. If you want general data visualization advice. then that might be more appropriate for [stats.se] as that's listed as [on topic](https://stats.stackexchange.com/help/on-topic) there. At the very least you should include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample data so that possible solutions can be tested. – MrFlick May 11 '22 at 14:38
  • I add in the description what I want to achieve: how can I represent the dendrogram without so much overlapping data appearing at the bottom? It looks very confusing with so much data together. – David Perea May 11 '22 at 14:42
  • 1
    It depends how you want to present it. If html, for example, you could rotate it and show a long visualisation like [this](https://www.quantumjitter.com/project/hansard/) which uses Rmarkdown. – Carl May 11 '22 at 14:57
  • Are you saying all those things are the bottom are labels that you want to read? How many of them are there? It doesn't seem like any reasonably sized image would allow you to read every one of those labels. It's unclear exactly what type of conclusion you want to draw from this image. Do you just want to hide the labels completely? Or what exactly do you want? – MrFlick May 11 '22 at 14:58

1 Answers1

0

Two things might help: make the y-axis log-scale, and reduce line thickness.

The former is easy, but changing the line thickness of an existing ggplot object is fiddly.

Below is an example of what I have done in my recent analysis. I didn't use the fviz_dend function, instead I used as.dendrogram followed by ggplot().

If you want to work with your existing fviz plot, you could change the line thickness using the same method.

Also with a large number of leaves, you might as well hide the labels (i.e. expand=c(0,0) in scale_y)


Calculate the hierarchical clustering:

require(RColorBrewer)
require(stats)
require(dendextend)
n = 4
hdata <- hclust(dist(data, "minkowski", p=2), method="ward.D")
clusters = cutree(hdata, k = n)
# vector of up to 16 different colours
col_vector = c(brewer.pal(n=10,"Paired"), brewer.pal(n=6,"Set2")) 

Plot before:

hdata %>%
  as.dendrogram %>%
  color_branches(k = n, col = col_vector) %>%
  ggplot() + theme_classic() + theme.text +
  theme(panel.grid.major.y = element_line(),axis.title=element_blank(),
        axis.title.y=element_blank(),axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_continuous(expand=c(0,0)) +
  scale_x_continuous(expand=c(0.001,0.001)) +
  labs(y="")

enter image description here

Plot after:

b = hdata %>%
  as.dendrogram %>%
  color_branches(k = n, col = col_vector) %>%
  ggplot() + theme_classic() + theme.text +
  theme(panel.grid.major.y = element_line(),axis.title=element_blank(),
        axis.title.y=element_blank(),axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_log10() +
  scale_x_continuous(expand=c(0.001,0.001)) +
  labs(y="")
# Adjust the line thickness
b = ggplot_build(b)
b$data[[1]]$size = 0.2
b = ggplot_gtable(b)
plot(b)

enter image description here

VitaminB16
  • 1,174
  • 1
  • 3
  • 17