4

I'm trying to label the nodes in a dendrogram produced by scipy.cluster.hierarchy.dendrogram.

I'm working with the augmented dendrogram suggested here, trying to replace the inter-cluster distance labels (1.01,1.57) in the example by strings such as ('a+c','a+b+c').

An example linkage matrix is below

Z = array([[ 2,  7,  0,  2],
           [ 0,  9,  0,  2],
           [ 1,  6,  0,  2],
           [ 5, 10,  0,  3],
           [11, 12,  0,  4],
           [ 4,  8,  0,  2],
           [14, 15,  0,  6],
           [13, 16,  0,  9],
           [ 3, 17,  1, 10]])

For this example I created temporary labels as follows :

labels = [str(Z[ind,0].astype(int))+'+'+str(Z[ind,1].astype(int)) for ind in range(len(Z))]

And modified the augmented_dendrogram to:

def augmented_dendrogram(labels,*args, **kwargs):
    ddata = cl.dendrogram(*args, **kwargs)
    if not kwargs.get('no_plot', False):
        for ind,(i, d) in enumerate(zip(ddata['icoord'], ddata['dcoord'])):
            x = 0.5 * sum(i[1:3])
            y = d[1]
            plt.plot(x, y, 'ro')
            plt.annotate(labels[ind], (x, y), xytext=(10,15),
                         textcoords='offset points',
                         va='top', ha='center')
return ddata

However, the resulting labels are not aligned with the nodes in the dendrogram:

enter image description here

How can I align the labels to the correct node?

Community
  • 1
  • 1
user666
  • 5,231
  • 2
  • 26
  • 35

2 Answers2

1

Here is a solution that I found when working on a similar problem. Note that the linkage matrix given in the OP lacks the distances (the third column). I insert these distances artificially and then use them to identify the correct indices of the nodes in the dendrogram. The scipy distance function produces linkage matrix with distances already ordered (at least when using ward as the linking method).

Here is my code:

    Z = np.array([[ 2,  7,  0,  2],
           [ 0,  9,  0,  2],
           [ 1,  6,  0,  2],
           [ 5, 10,  0,  3],
           [11, 12,  0,  4],
           [ 4,  8,  0,  2],
           [14, 15,  0,  6],
           [13, 16,  0,  9],
           [ 3, 17,  1, 10]], dtype=float)

    Z[:, 2] = np.arange(1., len(Z)+1)
    labels = [str(len(Z)+1+ind)+'='+str(Z[ind,0].astype(int))+'+'+str(Z[ind,1].astype(int)) for ind in range(len(Z))]

    fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    dn = dendrogram(Z, ax=ax)
    ii = np.argsort(np.array(dn['dcoord'])[:, 1])
    for j, (icoord, dcoord) in enumerate(zip(dn['icoord'], dn['dcoord'])):
        x = 0.5 * sum(icoord[1:3])
        y = dcoord[1]
        ind = np.nonzero(ii == j)[0][0]
        ax.annotate(labels[ind], (x,y), va='top', ha='center')
    plt.tight_layout()
    plt.savefig('./tmp.png')
    plt.close(fig)

The result is: enter image description here

Roger Vadim
  • 373
  • 2
  • 12
0

If I understand your question correctly, then you're looking for the field 'leaves' within the dictionary returned by scipy's dendrogram function. As per scipy's documentation:

For each i, H[i] == j, cluster node j appears in position i in the left-to-right traversal of the leaves, where and . If j is less than n, the i-th leaf node corresponds to an original observation. Otherwise, it corresponds to a non-singleton cluster.

In plain English, that means you can use this field to sort your labels into the correct order, e.g. by changing the corresponding line to:

plt.annotate(labels[ddata['leaves'][ind]], (x, y), xytext=(10,15), textcoords='offset points', va='top', ha='center')
alowet
  • 43
  • 4
  • As far as I can judge, this solution doesn't work, since `ddata['leaves']` includes only the leaves (i.e., only the terminal, but not the inner nodes). – Roger Vadim May 17 '21 at 14:23