I'm looking to annotate a hierarchical clustering dendrogram, but I have some trouble associating the node indices produced by scipy.cluster.hierarchy.dendrogram
when plotting, to the node indices in the original linkage matrix (e.g. produced with scipy.cluster.hierarchy.linkage
).
For instance, say we have the following example (adapted from this SO question),
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
%matplotlib inline
# generate two clusters: a with 10 points, b with 5:
np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[5,])
X = np.concatenate((a, b),)
Z = linkage(X, 'ward')
# make distances between pairs of children uniform
# (re-scales the horizontal (distance) axis when plotting)
Z[:,2] = np.arange(Z.shape[0])+1
def plot_dendrogram(linkage_matrix, **kwargs):
ddata = dendrogram(linkage_matrix, **kwargs)
idx = 0
for i, d, c in zip(ddata['icoord'], ddata['dcoord'],
ddata['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(y, x, 'o', c=c)
plt.annotate("%.3g" % idx, (y, x), xytext=(15, 5),
textcoords='offset points',
va='top', ha='center')
idx += 1
plot_dendrogram(Z, labels=np.arange(X.shape[0]),
truncate_mode='level', show_leaf_counts=False,
orientation='left')
which produces the following dendrogram:
The original X
matrix has (X.shape[0] == 15
) samples, and the tick labels on the vertical axis corresponds to the sample id for each tree leaf. The number at each node is the id of that node as returned by the dendrogram
function. Now if we look at the original linkage matrix, the 1st two columns give the children of each tree node,
print(Z[:,:2].astype('int'))
[[ 0 3]
[ 1 8]
[ 6 16]
[ 2 5]
...
[22 24]
[23 25]
[26 27]]
For instance, the node 0
in the linkage matrix has for children leaves [0, 3]
, but on the dendrogram above it is labeled as number 9
. Similarly the node 1
, is labeled as number 4
, etc.
I was wondering what would be the simplest way of finding the correspondence between these 2 indices? I looked at the dendrogram
function but didn't see any simple way of doing that (particularly if we truncate the dendrogram to some level (with e.g. truncate_mode='level', p=2
)...
Note: I'm actually using a linkage matrix given by sklearn.cluster.AgglomerativeClustering
but that doesn't really matter for this question (as illustrated in this github issue).
Note2: alternatively if there is a way to compute the list of leaves for every dendrogram node that would also solve my problem.