I have a matrix of pairwise differences between samples. I would like to label each sample as being part of a cluster, named by cluster size, where clusters are defined by an absolute cutoff in the matrix values, e.g. all those clusters with a difference of zero from each other.
Mock data:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns
## Generate fake data
# matrix
d = {'sample_A': [0,2,0,1,1,2,2,1], 'sample_B': [2,0,2,3,3,0,0,3], 'sample_C': [0,2,0,1,1,2,2,1], 'sample_D': [1,3,1,0,2,3,3,1],
'sample_E': [1,3,1,2,0,3,3,1], 'sample_F': [2,0,2,3,3,0,0,3], 'sample_G': [2,0,2,3,3,0,0,3], 'sample_H': [1,3,1,1,1,3,3,0]}
idx = ["sample_A","sample_B","sample_C","sample_D","sample_E", "sample_F", "sample_G", "sample_H"]
df = pd.DataFrame(data=d,index=idx)
df
# Visualise heatmap (this isn't directly needed for this output)
g = sns.clustermap(df, cmap="coolwarm_r")
g
# Desired output
d = {'cluster_zero': [2,1,2,"NA","NA",1,1,"NA"]}
df3 = pd.DataFrame(data=d,index=idx)
df3
So the output labels each sample as belonging to a cluster defined as having zero pairwise difference in the matrix, and names the cluster in order of size from largest to smallest. In this case, samples B, F and G all have zero differences, so get put in cluster 1. Samples A and C also have zero differences from each other, and as that cluster is smaller than B/F/G they are cluster 2. There are no other samples with zero differences in this case, so the other samples don't get a cluster.
Ideally, I would like to be able to control the threshold of difference I used to define clusters, e.g. run the script again but using a threshold of <1 or <2 rather than zero.
There are various questions similar to this (e.g. Extracting clusters from seaborn clustermap), but they seem to use metrics of calculating distance rather than the absolute count in the matrix. Another similar question is: generating numerical clusters from matrix values of a minimal size but this counts the size of each cluster, which is different to the output I want.
Thanks for your help.