is there a way to preserve the clustering in a heatmap but reduce the number of observations?

Question

I have data-set with 90 observations(rows) across 20 columns. I have generated a pretty neat heatmap which clusters my data in two groups with the package pheatmap. Although its not entirely clean but the two clusters of dendrogram pretty much separates my samples in 2 distinct groups as per my conditions. Now I want to reduce this set of 90 to a stricter set around 20-30 obeservations but still want to preserve the same clustering order as shown in pheatmap. Is there a way to do that? or any other package that reduces my observations to a minimum set which can still preserve by clustering order as seen now? The code for pheatmap is

pheatmap(mydata[rownames(df.90),],scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=batch.annotation,cluster_col=T,fontsize_row = 8,fontsize_col = 8,clustering_method = "ward.D2",border_color = NA,)

any package in R that I am missing out can handle such or even something in the pheatmap I can use as a function for reducing the variables and make a kind of permutation test to find the minimum set of observations that can still retain my clustering

The data is genes in rows and expression in columns across patients.

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

I would like to answer my own question and want feedback. I used the kmeans_k=30 in the pheatmap and obtained 29 clusters that are still able to preserve my clustering of the 90 observations that I made previously. From there I obtained the genes in their respective clusters. I selected the top 5 clusters from that heatmap on either side of the observations that can still produce my required heatmap since they are the ones having high SD. Since all through my pheatmap I have scale="row" and kept both row dendrogram and col dendrogram on, I did not want to change them even now. So when I now plot this 31 genes(observations) in fact they improve my row clustering even more and totally partitions them in 2 groups in a more cleaner way as I wanted. Codes for kemans and new heatmap

with kmeans 30

obj<-pheatmap(df.90,scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=batch.annotation,cluster_col=T,fontsize_row = 6,fontsize_col = 7,clustering_method = "ward.D2",border_color = NA,cellwidth = NA,cellheight = NA,kmeans_k = 30)

retrieve the clusters and extract the observations/genes

obj$kmeans$cluster

obtaining the top clusters and plot them with the heatmap

pheatmap(mydata[rownames(df.31),],scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=batch.annotation,cluster_col=T,fontsize_row = 8,fontsize_col = 8,clustering_method = "ward.D2",border_color = NA,)

What you guys think of this approach? It is not like the one I intended but it is also not wrong I think. I would like to have feedback if someone can give a better method or approach or if they think it is also not correct. Thanks

This is great! @vchris_ngs what if you'd like to retrieve and extract the Observations/genes from your clusters from ```cutter_rows = 3``` instead of the ```kmeans_k= 3``` would be something like... ```obj2$tree_row$...?``` — Ecg, Dec 16 '20 at 22:57

is there a way to preserve the clustering in a heatmap but reduce the number of observations?

1 Answers1

with kmeans 30

retrieve the clusters and extract the observations/genes

obtaining the top clusters and plot them with the heatmap

Linked