1

I have a dataframe with a list of items and associated values. Which metric and method is best for performing the clustering?

  1. I want to create a seaborn clustermap (dendrogram Plus heatmap) from the list on the basis of rows only, map it (that is done as shown is code), but how can I get the list of items for each cluster or each protein with its cluster information. (similar to Extract rows of clusters in hierarchical clustering using seaborn clustermap, but only based on rows and not columns)

  2. How do I determine which "method" and "metric" is best for my data?

data.csv example:

item,v1,v2,v3,v4,v5
A1,1,2,3,4,5
B1,2,4,6,8,10
C1,3,6,9,12,15
A1,2,3,4,5,6
B2,3,5,7,9,11
C2,4,7,10,13,16

My code for creating the clustermap:

import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.cluster.hierarchy as sch

df = pd.read_csv('data.csv', index_col=0)
sns.clustermap(df, col_cluster=False, cmap="coolwarm", method='ward', metric='euclidean', figsize=(40,40))
plt.savefig('plot.pdf', dpi=300)
dar102
  • 27
  • 6
  • thank you for editing the question – dar102 Jun 27 '20 at 20:25
  • clustering is unsupervised, meaning there are metrics that tell you whether the clusters are stable or explain more variance, but in the end, it's quite subjective. It depends on your end goal and you yourself have to be clear about it. You can try different hierarchical clustering methods and provide the linkage using ```row_linkage=``` option in clustermap. – StupidWolf Jun 27 '20 at 23:11
  • @StupidWolf Thank you very much, Is there any way that we can check which method works best on our data (kind of validation). – dar102 Jun 30 '20 at 10:06

1 Answers1

0

I just hacked this together. Is this what you want?

import pandas as pd
import numpy as np
import seaborn as sns

cars = {'item': ['A1','B1','C1','A1','B1','C1'],
        'v1': [1.0,2.0,3.0,2.0,3.0,4.0],
        'v2': [2.0,4.0,6.0,3.0,5.0,7.0],
        'v3': [3.0,6.0,9.0,4.0,7.0,10.0],
        'v4': [4.0,8.0,12.0,5.0,9.0,13.0],
        'v5': [5.0,10.0,15.0,6.0,11.0,16.0]
        }

df = pd.DataFrame(cars)
df

heatmap_data = pd.pivot_table(df, values=['v1','v2','v3','v4','v5'], 
                              index=['item'])
heatmap_data.head()
sns.clustermap(heatmap_data)

df = df.drop(['item'], axis=1)
g = sns.clustermap(df)

enter image description here

Also, check out links below for more info on this topic.

https://seaborn.pydata.org/generated/seaborn.clustermap.html

https://kite.com/python/docs/seaborn.clustermap

ASH
  • 20,759
  • 19
  • 87
  • 200
  • Thank you so much, but I'm looking for seaborn custermap :) – dar102 Jun 28 '20 at 06:21
  • Sorry, I saw clustering and I thought you were referring to something else. I just updated my answer. Hope that helps. The only thing that I couldn't' really understand is the 'extracting rows' comment you made. – ASH Jun 29 '20 at 13:33
  • Thank you for your efforts, But This much is already done in my code. I wanted is the clustermap based on rows only (also done in my code with option of "col_cluster=False,"). What I wanted is the list format of my items to which cluster they belong after they have been clustered based on rows. – dar102 Jun 30 '20 at 10:01