35

I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results).

However I am having trouble figuring out how to programmatically extract the clusters. For instance, in the example link, how could I find out that 1-1 rh, 1-1 lh, 5-1 rh, 5-1 lh make a good cluster? Visually it's easy. I am trying to use methods of looking through the data, and dendrograms but I'm having little success

EDIT Code from example:

import pandas as pd
import seaborn as sns
sns.set(font="monospace")

df = sns.load_dataset("brain_networks", header=[0, 1, 2], index_col=0)
used_networks = [1, 5, 6, 7, 8, 11, 12, 13, 16, 17]
used_columns = (df.columns.get_level_values("network")
                          .astype(int)
                          .isin(used_networks))
df = df.loc[:, used_columns]

network_pal = sns.cubehelix_palette(len(used_networks),
                                    light=.9, dark=.1, reverse=True,
                                    start=1, rot=-2)
network_lut = dict(zip(map(str, used_networks), network_pal))

networks = df.columns.get_level_values("network")
network_colors = pd.Series(networks).map(network_lut)

cmap = sns.diverging_palette(h_neg=210, h_pos=350, s=90, l=30, as_cmap=True)

result = sns.clustermap(df.corr(), row_colors=network_colors, method="average",
               col_colors=network_colors, figsize=(13, 13), cmap=cmap)

How can I pull what models are in which clusters out of result?

EDIT2 The result does carry with it a linkage in with the dendrogram_col which I THINK would work with fcluster. But the threshold value to select that is confusing me. I would assume that values in the heatmap that are higher than the threshold would get clustered together?

Marcel M
  • 1,244
  • 12
  • 19
sedavidw
  • 11,116
  • 13
  • 61
  • 95

2 Answers2

25

While using result.linkage.dendrogram_col or result.linkage.dendrogram_row will currently work, it seems to be an implementation detail. The safest route is to first compute the linkages explicitly and pass them to the clustermap function, which has row_linkage and col_linkage parameters just for that.

Replacing the last line in your example (result = ...) with the following code gives the same result as before, but you will also have row_linkage and col_linkage variables that you can use with fcluster etc.

from scipy.spatial import distance
from scipy.cluster import hierarchy

correlations = df.corr()
correlations_array = np.asarray(df.corr())

row_linkage = hierarchy.linkage(
    distance.pdist(correlations_array), method='average')

col_linkage = hierarchy.linkage(
    distance.pdist(correlations_array.T), method='average')

sns.clustermap(correlations, row_linkage=row_linkage, col_linkage=col_linkage, row_colors=network_colors, method="average",
               col_colors=network_colors, figsize=(13, 13), cmap=cmap)

In this particular example, the code could be simplified more since the correlations array is symmetric and therefore row_linkage and col_linkage will be identical.

Note: A previous answer included a call to distance.squareshape according to what the code in seaborn does, but that is a bug.

Marcel M
  • 1,244
  • 12
  • 19
  • Hey @Marcel M, wouldn't you want to use a "dissimilarity matrix" instead of a correlation matrix? Like `1 - np.abs(correlations)` or something? – O.rka Jul 01 '16 at 16:18
  • 2
    @O.rka Passing correlations to `sns.clustermap()` comes from the seaborn example quoted in the question, which I just copied. Both versions compute distances between correlations, so in the end distances are in fact used, but I admit I don’t know how much sense it makes to do so (I don’t know why the seaborn example does so). In my own project, I use distances directly. – Marcel M Jul 01 '16 at 21:33
8

You probably want a new column in your dataframe with the cluster membership. I've managed to do this from assembled snippets of code stolen from all over the web:

import seaborn
import scipy

g = seaborn.clustermap(df,method='average')
den = scipy.cluster.hierarchy.dendrogram(g.dendrogram_col.linkage,
                                         labels = df.index,
                                         color_threshold=0.60)  
from collections import defaultdict

def get_cluster_classes(den, label='ivl'):
    cluster_idxs = defaultdict(list)
    for c, pi in zip(den['color_list'], den['icoord']):
        for leg in pi[1:3]:
            i = (leg - 5.0) / 10.0
            if abs(i - int(i)) < 1e-5:
                cluster_idxs[c].append(int(i))

    cluster_classes = {}
    for c, l in cluster_idxs.items():
        i_l = [den[label][i] for i in l]
        cluster_classes[c] = i_l

    return cluster_classes

clusters = get_cluster_classes(den)

cluster = []
for i in df.index:
    included=False
    for j in clusters.keys():
        if i in clusters[j]:
            cluster.append(j)
            included=True
    if not included:
        cluster.append(None)

df["cluster"] = cluster

So this gives you a column with 'g' or 'r' for the green- or red-labeled clusters. I determine my color_threshold by plotting the dendrogram, and eyeballing the y-axis values.

sjc
  • 1,117
  • 3
  • 19
  • 28
  • 1
    This isn't going to work on bigger data where that are more groups than colours since (for example) green will repeat itself this will group colours. – PvdL Sep 04 '17 at 13:03
  • 1
    For more details how this code works, one can see the "original" post here: `http://www.nxn.se/valent/extract-cluster-elements-by-color-in-python` – Dataman Jun 27 '19 at 08:03
  • @Dataman It's best that the original author gets credit, I surely had lost the original source by the time I had posted my snippet, and don't remember if I had made any significant changes to the original before posting. – sjc Jun 28 '19 at 17:06