Node indexing in hierarchical clustering dendrograms

Question

I'm looking to annotate a hierarchical clustering dendrogram, but I have some trouble associating the node indices produced by scipy.cluster.hierarchy.dendrogram when plotting, to the node indices in the original linkage matrix (e.g. produced with scipy.cluster.hierarchy.linkage).

For instance, say we have the following example (adapted from this SO question),

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
%matplotlib inline

# generate two clusters: a with 10 points, b with 5:
np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[5,])
X = np.concatenate((a, b),)
Z = linkage(X, 'ward')
# make distances between pairs of children uniform 
# (re-scales the horizontal (distance) axis when plotting)
Z[:,2] = np.arange(Z.shape[0])+1 

def plot_dendrogram(linkage_matrix, **kwargs):

    ddata = dendrogram(linkage_matrix, **kwargs)
    idx = 0
    for i, d, c in zip(ddata['icoord'], ddata['dcoord'], 
                       ddata['color_list']):       
        x = 0.5 * sum(i[1:3])
        y = d[1]
        plt.plot(y, x, 'o', c=c)
        plt.annotate("%.3g" % idx, (y, x), xytext=(15, 5),
                             textcoords='offset points',
                             va='top', ha='center')
        idx += 1
plot_dendrogram(Z, labels=np.arange(X.shape[0]),
                truncate_mode='level', show_leaf_counts=False, 
               orientation='left')

which produces the following dendrogram:

The original X matrix has (X.shape[0] == 15) samples, and the tick labels on the vertical axis corresponds to the sample id for each tree leaf. The number at each node is the id of that node as returned by the dendrogram function. Now if we look at the original linkage matrix, the 1st two columns give the children of each tree node,

print(Z[:,:2].astype('int'))
  [[ 0  3]
   [ 1  8]
   [ 6 16]
   [ 2  5]
   ...
   [22 24]
   [23 25]
   [26 27]]

For instance, the node 0 in the linkage matrix has for children leaves [0, 3], but on the dendrogram above it is labeled as number 9. Similarly the node 1, is labeled as number 4, etc.

I was wondering what would be the simplest way of finding the correspondence between these 2 indices? I looked at the dendrogram function but didn't see any simple way of doing that (particularly if we truncate the dendrogram to some level (with e.g. truncate_mode='level', p=2)...

Note: I'm actually using a linkage matrix given by sklearn.cluster.AgglomerativeClustering but that doesn't really matter for this question (as illustrated in this github issue).

Note2: alternatively if there is a way to compute the list of leaves for every dendrogram node that would also solve my problem.

Prune · Accepted Answer · 2016-11-01T16:48:10.717

What I can deduce from the posted material are things I expect you've already noticed:

The linkage matrix nodes are numbered in order of the clustering. The original nodes are 0-14. [0 3] becomes node 15, [1 8] is node 16, [6 [1 8]] is node 17, etc.
The dendogram nodes are numbered from the root, depth-first, upper (right) branch before left (lower), with labels N-1 down to 0.
Right and left are determined by the columns in the linkage matrix.

The linkage matrix has 2N+1 rows: N+1 original nodes and N clusterings. To reconstruct the dendogram labels, start at the last row of the matrix, [26 27]. This gets label N-1, or 13. Recur on the right node (column 1), then the left (column 0). Decrement the label value each time you assign it.

Is that clear enough to follow?

Recursion depth is a problem for you? The tree depth shouldn't be much more than log2(N) for N nodes, or about 22-25 for a million nodes. Your run-time stack can't handle that? Mine defaults to one thousand.

In any case, it's not that hard to convert to iteration. Create a stack of unresolved nodes, starting with the root node.

stack = [root]
while stack is not empty:
    node = stack.pop()
    stack.push (node.right)
    stack.push (node.left)
    process node  # assign node number, etc.

This makes it even easier to maintain the node counter, as it's now a local variable. Also note that this is a basic graph-search algorithm: "While we have nodes, grab the next one on the list, push its neighbors onto the list (but check to see whether each one is already handled), and process the current node."

For depth-first, use a stack; for breadth-first, use a queue.

This is great, thanks, I failed to see the pattern. Here is an implementation using recursion following your suggestions that works http://pastebin.com/zgkb3Amk . However, because of the limitations on the number of recursions in python, this would not scale to a tree with e.g. 1e6 nodes (and I would rather not mess with `sys.setrecursionlimit`). Would you have any suggestions on how to convert this into a non recursive (i.e. using an iterator or a generator) algorithm? Thanks! P.S: writing it in typed Cython to avoid the recursion limit is also not an option for me. — rth, Nov 01 '16 at 12:11
Here is the same code using chained generators http://pastebin.com/PU7YutyY though I'm not sure if it's actually better performance wise for large tree sizes (seems to me it's still a recursion just using generators)? — rth, Nov 01 '16 at 12:24
You are right, the recursion depth is not N, as I originally though but much smaller log2(N), so it should not be an issue. Also thanks for the iterative code, it is interesting ! — rth, Nov 01 '16 at 17:45
Glad to help. Don't worry about the slip; I've seen people here actually implement O(N) recursion! The iterative code is hardly new: it's a summary of Dijkstra's algorithm for graph traversal. — Prune, Nov 01 '16 at 18:13

score 0 · Answer 2 · edited Apr 26 '22 at 19:32

0

Another solution I have found is to compute all the children for a given node from the dendogram figure (code available in the _DendrogramChildren class here¹) using the geometric coordinates on the figure. Then because each node has a unique list of children, we can associate the node index on the figure with that of the linkage matrix.

However this solution is not satisfactory (uses coordinates of the nodes on the figure), does not scale well and would not work with a truncated dendogram. The solution proposed by @Prune above is more appropriate.

¹ not re-posting it here as it's too long

edited Apr 26 '22 at 19:32

joanis

10,635
14
30
40

answered Nov 01 '16 at 11:09

rth

10,680
7
53
77

It's not too long, and certainly an interesting approach! Please include it, as well as defining the **Z** structure so that the code executes cleanly. – Prune Nov 01 '16 at 16:33
1

Yes, the generator is still recursive. There's nothing wrong with that; my algorithm is also recursive. – Prune Nov 01 '16 at 16:36

Node indexing in hierarchical clustering dendrograms

2 Answers2

Linked