19

I want to color my clusters with a color map that I made in the form of a dictionary (i.e. {leaf: color}).

I've tried following https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/ but the colors get messed up for some reason. The default plot looks good, I just want to assign those colors differently. I saw that there was a link_color_func but when I tried using my color map (D_leaf_color dictionary) I got an error b/c it wasn't a function. I've created D_leaf_color to customize the colors of the leaves associated with particular clusters. In my actual dataset, the colors mean something so I'm steering away from arbitrary color assignments.

I don't want to use color_threshold b/c in my actual data, I have way more clusters and SciPy repeats the colors, hence this question. . .

How can I use my leaf-color dictionary to customize the color of my dendrogram clusters?

I made a GitHub issue https://github.com/scipy/scipy/issues/6346 where I further elaborated on the approach to color the leaves in Interpreting the output of SciPy's hierarchical clustering dendrogram? (maybe found a bug...) but I still can't figure out how to actually either: (i) use dendrogram output to reconstruct my dendrogram with my specified color dictionary or (ii) reformat my D_leaf_color dictionary for the link_color_func parameter.

# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

# Load data
from sklearn.datasets import load_diabetes

# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too

%matplotlib inline

# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])

# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())

# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")

# Color mapping
D_leaf_colors = {"attr_1": "#808080", # Unclustered gray

                 "attr_4": "#B061FF", # Cluster 1 indigo
                 "attr_5": "#B061FF",
                 "attr_2": "#B061FF",
                 "attr_8": "#B061FF",
                 "attr_6": "#B061FF",
                 "attr_7": "#B061FF",

                 "attr_0": "#61ffff", # Cluster 2 cyan
                 "attr_3": "#61ffff",
                 "attr_9": "#61ffff",
                 }

# Dendrogram
# To get this dendrogram coloring below  `color_threshold=0.7`
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, leaf_font_size=12, leaf_rotation=45, link_color_func=D_leaf_colors)
# TypeError: 'dict' object is not callable

enter image description here

I also tried how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy

Community
  • 1
  • 1
O.rka
  • 29,847
  • 68
  • 194
  • 309
  • I can't tell from your description what you want the resulting dendrogram to look like *in general* (i.e., for an arbitrary leaf color dictionary). As far as I can tell, it doesn't make sense to specify colors in terms of leaves alone, because you have no guarantee that the leaves you give the same color will be near each other in the dendrogram. The things in the dendrogram that are colored are not leaves; they are the links between clusters. Did you somehow generate your `leaf_colors` from the clusters? If so, can't you instead generate the linkage colors from the clusters? – BrenBarn Jul 05 '16 at 16:54
  • This is true but the way I made the leaf color dictionary is by using fcluster to get the actual clusters – O.rka Jul 05 '16 at 21:05
  • But can't you instead use similar logic to get the linkages and specify colors in terms of those? You can't get the colors just on the basis of `fcluster`, because `fcluster` only returns *flat* clusters and throws away the information about the lower-level clusters. You need the full linkage structure. – BrenBarn Jul 05 '16 at 21:07
  • From `fcluster` I get an array of length `n` where `n` is the amount of samples I'm clustering. Each index of that array has the cluster number. I iterate through that array and the original labels at the same time to assign the samples to clusters. – O.rka Jul 05 '16 at 21:14
  • 1
    Right, but do you see that the dendrogram includes much more information than that? The dendrogram doesn't just indicate a single flat set of clusters. It shows the complete "history" of when each cluster was merged with each other cluster. Each arch represents the joining of two clusters, so whatever coloring information you give has to provide information about pairs of clusters, not just individual "root" clusters or individual leaf nodes. If you only care about the final clusters, you may not even need to use a dendrogram at all. – BrenBarn Jul 05 '16 at 21:17
  • Yes that's true. I wanted to pair it with the dendrogram for visual purposes. For me it's the easiest to visualize in terms of leaf labels and working backwards from the clusters generated from a distance cutoff . I understand how the algorithm works but to use the links instead of the labels from the beginning I would have to redo a lot of my wrappers. In hindsight, I should have just started it in that way . I converted the leafs color dict to a link color dict with ulrichs help below – O.rka Jul 07 '16 at 14:51

4 Answers4

15

Here a solution that uses the return matrix Z of linkage() (described early but a little hidden in the docs) and link_color_func:

# see question for code prior to "color mapping"

# Color mapping
dflt_col = "#808080"   # Unclustered gray
D_leaf_colors = {"attr_1": dflt_col,

                 "attr_4": "#B061FF", # Cluster 1 indigo
                 "attr_5": "#B061FF",
                 "attr_2": "#B061FF",
                 "attr_8": "#B061FF",
                 "attr_6": "#B061FF",
                 "attr_7": "#B061FF",

                 "attr_0": "#61ffff", # Cluster 2 cyan
                 "attr_3": "#61ffff",
                 "attr_9": "#61ffff",
                 }

# notes:
# * rows in Z correspond to "inverted U" links that connect clusters
# * rows are ordered by increasing distance
# * if the colors of the connected clusters match, use that color for link
link_cols = {}
for i, i12 in enumerate(Z[:,:2].astype(int)):
  c1, c2 = (link_cols[x] if x > len(Z) else D_leaf_colors["attr_%d"%x]
    for x in i12)
  link_cols[i+1+len(Z)] = c1 if c1 == c2 else dflt_col

# Dendrogram
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None,
  leaf_font_size=12, leaf_rotation=45, link_color_func=lambda x: link_cols[x])

Here the output: dendrogram

Ulrich Stern
  • 10,761
  • 5
  • 55
  • 76
  • Hey thanks for your answer, what is the best way to leave the `leaves` as `labels`? I was writing a backwards dictionary but the indices of `D_leaf_colors` from the `for-loop` are confusing. I have a lot of functions that depend on others so the indices throw it off a lot – O.rka Jul 05 '16 at 18:32
  • I'm going to try to work backwards from this. My actual leaf labels are like "F8_2/3/13_Pre" so using the indices to construct `link_cols` won't work with my real data. Going to mess with it and let you know. – O.rka Jul 05 '16 at 20:49
  • For `c1, c2 = (link_cols[x] if x > len(Z) else D_leaf_colors["attr_%d"%x] for x in i12)` how can `link_cols[x]` get `x` if `link_cols = {}` and isn't updated yet? I'm getting a key error – O.rka Jul 05 '16 at 21:11
  • 1
    Strange. The `x` should access only already set keys in `link_cols`. Can you `print Z[:,:2].astype(int)`? – Ulrich Stern Jul 05 '16 at 21:18
  • 1
    Is it possible the key error is from `D_leaf_colors`? – Ulrich Stern Jul 05 '16 at 21:20
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116518/discussion-between-ulrich-stern-and-o-rka). – Ulrich Stern Jul 05 '16 at 21:22
  • One more question, am I using my custom distance measure correctly? I'm reading that there is a bug if this is used ina particular way. I like computing my own distance measures. – O.rka Jul 06 '16 at 14:46
  • 1
    Do you have a link describing the bug? – Ulrich Stern Jul 06 '16 at 17:04
  • https://github.com/scipy/scipy/issues/2614 I should have used the term `issue` instead of `bug` btw – O.rka Jul 06 '16 at 18:25
  • 1
    Your code correctly passes the **condensed** ("n choose 2") distance matrix to `linkage()`. See also [this answer](http://stackoverflow.com/a/18954990/1628638). – Ulrich Stern Jul 07 '16 at 09:17
7

Two-liner for applying custom colormap to cluster branches:

import matplotlib as mpl
from matplotlib.pyplot import cm
from scipy.cluster import hierarchy

cmap = cm.rainbow(np.linspace(0, 1, 10))
hierarchy.set_link_color_palette([mpl.colors.rgb2hex(rgb[:3]) for rgb in cmap])

You can then replace rainbow by any cmap and change 10 for the number of cluster you want.

alelouis
  • 102
  • 1
  • 3
  • It seems the O.rka has issues getting his colors, which he defined in a dictionary into the plot. Maybe you could adapt your minimum working example to show him how this is done. – NOhs Sep 11 '17 at 13:37
  • Oh right, was searching myself how to apply a custom colormap and could not find any easy solution. Hope it helps people searching for that special thing ;) – alelouis Sep 11 '17 at 17:39
0

This answer helped but wasn't trivial to translate to a more general case - here is a function running scipy's agglomerative clustering and plotting the respective dendrogram, with custom-provided colors, for a given distance threshold:

def rgb_hex(color):
    '''converts a (r,g,b) color (either 0-1 or 0-255) to its hex representation.
    for ambiguous pure combinations of 0s and 1s e,g, (0,0,1), (1/1/1) is assumed.'''
    message='color must be an iterable of length 3.'
    assert hasattr(color, '__iter__'), message
    assert len(color)==3, message
    if all([(c<=1)&(c>=0) for c in color]): color=[int(round(c*255)) for c in color] # in case provided rgb is 0-1
    color=tuple(color)
    return '#%02x%02x%02x' % color

def get_cluster_colors(n_clusters, my_set_of_20_rgb_colors, alpha=0.8, alpha_outliers=0.05):
    cluster_colors = my_set_of_20_rgb_colors
    cluster_colors = [c+[alpha] for c in cluster_colors]
    outlier_color = [0,0,0,alpha_outliers]
    return [cluster_colors[i%19] for i in range(n_clusters)] + [outlier_color]

def cluster_and_plot_dendrogram(X, threshold, method='ward', metric='euclidean', default_color='black'):

    # perform hierarchical clustering
    Z              = hierarchy.linkage(X, method=method, metric=metric)

    # get cluster labels
    labels         = hierarchy.fcluster(Z, threshold, criterion='distance') - 1
    labels_str     = [f"cluster #{l}: n={c}\n" for (l,c) in zip(*np.unique(labels, return_counts=True))]
    n_clusters     = len(labels_str)

    cluster_colors = [rgb_hex(c[:-1]) for c in get_cluster_colors(n_clusters, alpha=0.8, alpha_outliers=0.05)]
    cluster_colors_array = [cluster_colors[l] for l in labels]
    link_cols = {}
    for i, i12 in enumerate(Z[:,:2].astype(int)):
        c1, c2 = (link_cols[x] if x > len(Z) else cluster_colors_array[x] for x in i12)
        link_cols[i+1+len(Z)] = c1 if c1 == c2 else 'k'

    # plot dendrogram with colored clusters
    fig = plt.figure(figsize=(12, 5))
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('Data points')
    plt.ylabel('Distance')

    # plot dendrogram based on clustering results
    hierarchy.dendrogram(
        Z,

        labels = labels,

        color_threshold=threshold,

        truncate_mode = 'level',
        p = 5,
        show_leaf_counts = True,
        leaf_rotation=90,
        leaf_font_size=10,
        show_contracted=False,

        link_color_func=lambda x: link_cols[x],
        above_threshold_color=default_color,
        distance_sort='descending',
        ax=plt.gca()
    )
    plt.axhline(threshold, color='k')
    for i, s in enumerate(labels_str):
        plt.text(0.8, 0.95-i*0.04, s,
                transform=plt.gca().transAxes,
                va='top', color=cluster_colors[i])
    
    fig.patch.set_facecolor('white')

    return labels # 0 indexed

This returns the cluster labels, and generates a plot like this: enter image description here

Hope this helps someone in the future.

Maxime Beau
  • 688
  • 10
  • 6
  • This is nice, but for some reason it doesn't keep the distinction between singletons and (number of singletons in a a given cluster). For some reason it outputs the same index across multiple clusters where sample size is 1. – TunaFishLies Aug 14 '23 at 19:43
-1

I found a hackish solution, and does require to use the color threshold (but I need to use it in order to obtain the same original coloring, otherwise the colors are not the same as presented in the OP), but could lead you to a solution. However, you may not have enough information to know how to set the color palette order.

# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

# Load data
from sklearn.datasets import load_diabetes

# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list, set_link_color_palette
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too

%matplotlib inline
# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])

# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())

# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")

# Color mapping dict not relevant in this case
# Dendrogram
# To get this dendrogram coloring below  `color_threshold=0.7`
#Change the color palette, I did not include the grey, which is used above the threshold
set_link_color_palette(["#B061FF", "#61ffff"])
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=.7, leaf_font_size=12, leaf_rotation=45, 
               above_threshold_color="grey")

The result:

enter image description here

rll
  • 5,509
  • 3
  • 31
  • 46
  • Did you run this answer? `link_color_func=getcolor` throws a `KeyError`. – wflynny Jul 05 '16 at 15:08
  • Yes, was just figuring it out. It is corrected. The indices go from 10 to 18. Possibly it corresponds to attribbs 1 to 9, the mapping of the attribs is not correct but it is the solution... – rll Jul 05 '16 at 15:16
  • That's not correct. The indices `n + 1` to `n + n` (10 to 18 here) correspond to the clusters in the linkage matrix `Z`. – wflynny Jul 05 '16 at 15:26
  • 1
    At least I understand better the problem now :) – rll Jul 05 '16 at 15:40