5

I'm looking for a way to get the coordinates of a cluster point in the dendrogram plot based on its ClusterNode return by to_tree.

Using scipy to build a dendogram from data such as:

X = data
Y = pdist(X)
Z = linkage(Y)
dend = dendrogram(Z)
rootnode, nodesList = to_tree(Z, rd=True)

What I would like to do is build a function get_coords(somClusterNode) that would return the tuple (x, y) specifying the position of the node in the plot.

Thanks to this answer, I managed to figure out how to get the position from the dendrogram return values, such as:

i, d = list(zip(dend['icoord'], dend['dcoord']))[-1]
x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(x, y, 'ro')

But I can figure out a relation between the nodesList ordering and the icoord/dcoord ordering in order to map one to the other.

Do you have any idea where I could look for ?

Thanks for your help !

Community
  • 1
  • 1
sereizam
  • 2,048
  • 3
  • 20
  • 29
  • What version of scipy are you using? When I try to run your code I get an error: `ValueError: Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'.` Are you sure that line 3 should not be `Z = linkage(X, method="ward")`? – Paul Brodersen Apr 20 '17 at 10:01
  • I use SciPy v.0.19.0 with python v.3.5.2 – sereizam Apr 20 '17 at 10:23
  • It seems that both are compatible : "The input y may be either a 1d compressed distance matrix or a 2d array of observation vectors." in https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage – sereizam Apr 20 '17 at 10:27
  • Yeah, but if you specify method="ward", my version will only work when providing the original observations. Upgrading my scipy installation to see if the issue persists... – Paul Brodersen Apr 20 '17 at 10:32
  • ok. Then any method will do fine. I don't think my question depend on the linkage method I've chosen. I edited the post accordingly. – sereizam Apr 20 '17 at 10:34
  • Yeah, but I couldn't get your code to run. Was just a version issue. Your code is fine. – Paul Brodersen Apr 20 '17 at 10:38

2 Answers2

3

Each dendrogram maps to only one tree of ClusterNodes, but any tree of ClusterNodes could map to an infinite number of dendrograms. Hence your mapping from node ID to (x,y) positions should probably just be another field in your dendrogram data structure instead of being a function of a ClusterNode. Instead of defining a function get_coords, I hence appends a dictionary to dend that maps node IDs to (x,y) coordinates. You can access the positions with

x,y = dend['node_id_to_coord'][node_id] # node_id is an integer as returned by ClusterNode.id

Code:

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram, to_tree
from scipy.spatial.distance import pdist

# create some random data
X = np.random.rand(10, 3)

# get dendrogram
Z = linkage(pdist(X), method="ward")
dend = dendrogram(Z)

# ----------------------------------------
# get leave coordinates, which are at y == 0

def flatten(l):
    return [item for sublist in l for item in sublist]
X = flatten(dend['icoord'])
Y = flatten(dend['dcoord'])
leave_coords = [(x,y) for x,y in zip(X,Y) if y==0]

# in the dendogram data structure,
# leave ids are listed in ascending order according to their x-coordinate
order = np.argsort([x for x,y in leave_coords])
id_to_coord = dict(zip(dend['leaves'], [leave_coords[idx] for idx in order])) # <- main data structure

# ----------------------------------------
# get coordinates of other nodes

# this should work but doesn't:

# # traverse tree from leaves upwards and populate mapping ID -> (x,y);
# # use linkage matrix to traverse the tree optimally
# # (each row in the linkage matrix corresponds to a row in dend['icoord'] and dend['dcoord'])
# root_node, node_list = to_tree(Z, rd=True)
# for ii, (X, Y) in enumerate(zip(dend['icoord'], dend['dcoord'])):
#     x = (X[1] + X[2]) / 2
#     y = Y[1] # or Y[2]
#     node_id = ii + len(dend['leaves'])
#     id_to_coord[node_id] = (x, y)

# so we need to do it the hard way:

# map endpoint of each link to coordinates of parent node
children_to_parent_coords = dict()
for i, d in zip(dend['icoord'], dend['dcoord']):
    x = (i[1] + i[2]) / 2
    y = d[1] # or d[2]
    parent_coord = (x, y)
    left_coord = (i[0], d[0])
    right_coord = (i[-1], d[-1])
    children_to_parent_coords[(left_coord, right_coord)] = parent_coord

# traverse tree from leaves upwards and populate mapping ID -> (x,y)
root_node, node_list = to_tree(Z, rd=True)
ids_left = range(len(dend['leaves']), len(node_list))

while len(ids_left) > 0:

    for ii, node_id in enumerate(ids_left):
        node = node_list[node_id]
        if (node.left.id in id_to_coord) and (node.right.id in id_to_coord):
            left_coord = id_to_coord[node.left.id]
            right_coord = id_to_coord[node.right.id]
            id_to_coord[node_id] = children_to_parent_coords[(left_coord, right_coord)]

    ids_left = [node_id for node_id in range(len(node_list)) if not node_id in id_to_coord]

# plot result on top of dendrogram
ax = plt.gca()
for node_id, (x, y) in id_to_coord.iteritems():
    if not node_list[node_id].is_leaf():
        ax.plot(x, y, 'ro')
        ax.annotate(str(node_id), (x, y), xytext=(0, -8),
                    textcoords='offset points',
                    va='top', ha='center')

dend['node_id_to_coord'] = id_to_coord

enter image description here

Paul Brodersen
  • 11,221
  • 21
  • 38
  • Thank you so much ! It was days I was trying to solve this ! Ty, ty, ty ! – sereizam Apr 20 '17 at 12:44
  • I tried to edit to change `.iteritems()` to `.items()` to make it compatible with python3 however, stack doesn't allow edit inferior to 6 characters. – sereizam Apr 20 '17 at 12:45
  • 1
    My pleasure. I will make the edit in a second. Also, I just figured out that I could use the linkage matrix to traverse the tree more effectively. Edit incoming. – Paul Brodersen Apr 20 '17 at 12:49
  • 1
    Ok, this version is much simpler now. I somehow hadn't realised that each link in the linkage matrix corresponds to the `dendrogram['icoord']` and `dendrogram['dcoords']`. – Paul Brodersen Apr 20 '17 at 13:06
  • I just figured out that the new version place the point in the wrong place (I can't figure out why...). Rolling back to the first version works perfectly. – sereizam Apr 20 '17 at 15:54
  • 1
    Yeah, you are right, the node IDs sometimes end up in all the wrong places. I will roll back, too. – Paul Brodersen Apr 20 '17 at 16:04
  • `while len(ids_left) > 0:` `ids_left = [node_id for node_id in range(len(node_list)) if not node_id in id_to_coord]` This two line of code created trouble for me when I tried running them on my truncated dendogram. By omitting this two line, this code served my purpose. – avijit May 12 '19 at 12:43
0

There is another way to do it:

The ids of the dendogram appear to be generated by a reversed right to left traversal of the tree. This allows us to construct a translation from node.id to the index of icoord and dcoord as follows:

def rl_traversal(node):
    # skipping leaves
    if not node.is_leaf():
        yield node.id
        yield from rl_traversal(node.right)
        yield from rl_traversal(node.left)

id_map = dict(zip( rl_traversal(root), reversed(range(root.get_count()-1))) ))
# id_map[node_id] = dendogram_id

The coordinates of a node can then be obtained via dendo['icoord'][id_map[node_id]]

Hyperplane
  • 1,422
  • 1
  • 14
  • 28