2

I have the following code to perform hierarchical clutering on data:

Z = linkage(data,method='weighted')
  plt.subplot(2,1,1)
  dendro = dendrogram(Z)
  leaves = dendro['leaves']
  print leaves
  plt.show()

How ever at the dendogram all the clusters have the same color (blue). Is there a way to use different colors with respect to similarity in between clusters?

curious
  • 1,524
  • 6
  • 21
  • 45

2 Answers2

5

Look at the documentation, Looks like you could pass the link_color_func keyword or color_threshold keyword to have different colors.

Edit:

The default behavior of the dendrogram coloring scheme is, given a color_threshold = 0.7*max(Z[:,2]) to color all the descendent links below a cluster node k the same color if k is the first node below the cut threshold; otherwise, all links connecting nodes with distances greater than or equal to the threshold are colored blue [from the docs].

What the hell does this mean? Well, if you look at a dendrogram, different clusters linked together. The "distance" between two clusters is the height of the link between them. The color_threshold is the height below which new clusters will be different colors. If all your clusters are blue, then you need to raise your color_threshold. For example,

In [48]: mat = np.random.rand(10, 10)
In [49]: z = linkage(mat, method="weighted")
In [52]: d = dendrogram(z)
In [53]: d['color_list']
Out[53]: ['g', 'g', 'b', 'r', 'c', 'c', 'c', 'b', 'b']
In [54]: plt.show()

enter image description here

I can check what the default color_threshold is by

In [56]: 0.7*np.max(z[:,2])
Out[56]: 1.0278719020096947

If I lower the color_threshold, I get more blue because more links have distances greater than the new color_threshold. You can see this visually because all the links above 0.9 are now blue:

In [64]: d = dendrogram(z, color_threshold=.9)
In [65]: d['color_list']
Out[65]: ['g', 'b', 'b', 'r', 'b', 'b', 'b', 'b', 'b']
In [66]: plt.show()

enter image description here

If I increase the color_threshold to 1.2, the links below 1.2 will no longer be blue. Additionally, the cyan and red links will merge into a single color because their parent link is below 1.2:

enter image description here

wflynny
  • 18,065
  • 5
  • 46
  • 67
  • Can you post an example. I am trying to use them but i don't know what to pass as argument to both – curious Jul 05 '13 at 16:32
  • Maybe it has to do with my distance matrix. All pairwise distances are around 0.8 so maybe that's way i am getting the same color – curious Jul 05 '13 at 23:07
  • how do you use `link_color_func` if you have a dictionary that maps the leaves/nodes to their colors? – O.rka Jul 01 '16 at 23:45
  • Have you tried `dendrogram(Z, link_color_func=lambda k: colors[k])` where `colors` is a dict that maps ids of the links (upside-down Us) to colors? – wflynny Jul 01 '16 at 23:55
  • Hey thanks for getting back to me. I posted my example right here http://stackoverflow.com/questions/38153829/color-leaves-of-scipy-dendrogram-in-python-link-color-func . I used it in that same way but i'm getting a `key error` b/c my keys are the `leaf labels` and the key it wants is an `int`? – O.rka Jul 03 '16 at 01:18
1

The following code will produce a dendrogram with a different color for each leaf. If in the process of merging clusters it encounters two clusters with different colors, then it selects the default one dflt_col = tab:blue.

Note: the link_matrix function is a plain-copy of the one from the AgglomerativeClustering example in scikit-learn.

To explain what all it does, it's really time-consuming. Thus, print directly every unclear step.

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import squareform, pdist

from matplotlib.pyplot import cm

from sklearn.cluster import AgglomerativeClustering
import matplotlib.colors as clrs

def link_matrix(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram as in the standard sci-kit learn documentation
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count
    
    Z = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    return Z


def assign_link_colors(model):
    n_clusters = len(model.Z)
    scl_map_to_hex = mpl.cm.ScalarMappable(cmap = "jet").to_rgba(np.unique(model.labels_), norm = True) #colors.to_hex()
    col = [clrs.to_hex(rgb) for rgb in scl_map_to_hex]

    dic_labels = {s:[c, idx] for s, c, idx in zip(np.arange(len(model.feature_names_in_), dtype = int), model.feature_names_in_, model.labels_, )}
    model.dict_idx_name_cl = {k: v for k, v in sorted(dic_labels.items(), key=lambda item: item[1][1])}

    

    dflt_col = "tab:blue"   # Unclustered blue
    model.dict_colors = {x:col[model.dict_idx_name_cl[x][1]] for x in model.dict_idx_name_cl}
        
    link_cols = {}
    for i, i_cl in enumerate(model.Z[:,:2].astype(int)): # select only 1st two rows
        c1, c2 = (link_cols[x] if x > n_clusters else model.dict_colors[x] for x in i_cl)

        # Choice of coloring assignment: if same color --> ok; if no leaf, dft ("undefined") color 
        if c1 == c2:
            tmp_cl = c1 
        elif min(i_cl) <= n_clusters: # select the leaf color
            tmp_cl = model.dict_colors[min(i_cl)]
        else: 
            tmp_cl = dflt_col
        link_cols[i+1+n_clusters] = tmp_cl
        #print(f'-link_cols: {link_cols}',)
    
    return link_cols

def mod_2_dendrogram(model, **kwargs):

    plt.style.use('seaborn-whitegrid')
    plt.figure(figsize=(int(.5 * len(model.feature_names_in_)), 7))

    print(f'-0.7*max(Z[:,2]): {0.7*max(model.Z[:,2])}',)

    # Plot the corresponding dendrogram
    ddata = dendrogram(model.Z, #count_sort = "descending", 
                        **kwargs)

    # Plot distances on the dendrogram
    # plot cluster points & distance labels
    y_lim = dist_thr
    for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
        x = sum(i[1:3])/2
        y = d[1]
        if y > y_lim:
            plt.plot(x, y, 'o', c=c, markeredgewidth=0)
            plt.annotate(np.round(y,2), (x, y), xytext=(0, -5),
                        textcoords='offset points',
                        va='top', ha='center', fontsize=9)

    plt.axhline(y=dist_thr, color='orange', alpha = 0.7, linestyle='--', label = f"threshold: {int(model.dist_thr)}")
    plt.title(f'Agglomerative Dendrogram with n_clust: {model.n_clusters_}')
    plt.xlabel('Clusters')
    plt.ylabel('Distance')
    plt.legend()

    return ddata

Now, the running example:

import string
import pandas as pd
np.random.seed(0)
dist = np.random.randint(1e4, size = (10,10))
np.fill_diagonal(dist, 0)
dist = pd.DataFrame(dist, columns = list(string.ascii_lowercase)[:dist.shape[0]])

dist_thr = 1.5e3
model = AgglomerativeClustering(distance_threshold = dist_thr, n_clusters=None, linkage = "single", metric = "precomputed",)
model.dist_thr = dist_thr

model = model.fit(dist)
model.Z = link_matrix(model)

link_cols = assign_link_colors(model)

_ = mod_2_dendrogram(model, labels = dist.columns, 
                     link_color_func = lambda x: link_cols[x])

Here is the output

tdy
  • 36,675
  • 19
  • 86
  • 83
mrbigcorse
  • 11
  • 2