4

I have five points and I need to create dendrogram from these. The function 'dendrogram' can be used to find the ordering of these points as shown below. However, I do not want to use dendrogram as it is slow and result in error for large number of points (I asked this question here Python alternate way to find dendrogram). Can someone points me how to convert the 'linkage' output (Z) to the "dendrogram(Z)['ivl']" value.

>>> from hcluster import pdist, linkage, dendrogram
>>> import numpy
>>> from numpy.random import rand
>>> x = rand(5,3)
>>> Y = pdist(x)
>>> Z = linkage(Y)
>>> Z
array([[ 1.        ,  3.        ,  0.11443378,  2.        ],
       [ 0.        ,  4.        ,  0.47941843,  2.        ],
       [ 5.        ,  6.        ,  0.67596472,  4.        ],
       [ 2.        ,  7.        ,  0.79993986,  5.        ]])
>>> 



>>> dendrogram(Z)['ivl']
['2', '1', '3', '0', '4']
>>> 
Community
  • 1
  • 1
Curious
  • 3,507
  • 8
  • 28
  • 30

2 Answers2

10

There is a dedicated function for calculating linearized leaf orders in scipy. Here it is. scipy.cluster.hierarchy.leaves_list.

Sadık Yıldız
  • 101
  • 1
  • 4
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Dijkgraaf May 26 '15 at 23:27
  • @Dijkgraaf There is no point in describing entire clustering/sorting algorithm here. That's not the question at all. This is an essential part of the package, and it is specifically implemented for this purpose, so it's very unlikely to expect a major change. – Sadık Yıldız Jun 23 '15 at 19:08
  • You can add an example of how to use the function, what parameters are expected, or similar. As it stands, your answer is ***borderline*** link only, and therefore risks deletion without you editing it to add more detail. – Matt Jun 28 '15 at 05:34
3

Why is it slow? Sure, the naive way of computing linkage clustering is O(n^3) but for n=5 this is as cheap as it gets...

For the format of the scipy linkage matrix, see this question: scipy linkage format

Note that you may still need to sort the data optimally. The linkage matrix encoding above gives

  • Element 1 and Cluster 3 join at height 0.1144 (into a 2 element cluster, #5)
  • Element 0 and Cluster 4 join at height 0.7999 (into a 2 element cluster, #6)
  • Cluster 5 and Cluster 6 join at height 0.6759 (into a 4 element cluster, #7)
  • Element 2 and Cluster 7 join at height 0.7999 (into a 5 element cluster, #8)

but it might be ordered by linking distance, and not in a 1d ordering for visualization (because not everbody using linkage clustering will want to run dendrogram viusalization afterwards). But in any way, computing the dendrogram should be on the order of O(n log n) if you do need to sort, fairly cheap compared to the actual clustering.

Something along these lines should do the trick:

n = len(Z) + 1
cache = dict()
for k in range(len(Z)):
  c1, c2 = int(Z[k][0]), int(Z[k][1])
  c1 = [c1] if c1 < n else cache.pop(c1)
  c2 = [c2] if c2 < n else cache.pop(c2)
  cache[n+k] = c1 + c2
print cache[2*len(Z)]

This may appear to be linear, but the expected size of the arrays is log n, so depending on your list types it may still be O(n log n), while with linked lists it should indeed be doable in O(n).

But in the end, you might want to avoid hierarchical clustering. It is a popular introductory example to cluster analysis, because it is really easy to understand conceptually. There are some quite tricky algorithms (SLINK) to get it down to O(n^2) complexity. But there are more modern and powerful clustering algorithms that have lower complexity. Actually, OPTICS (Wikipedia) computes something quite similar (when you set minPts=2), and when you have a good index structure it will run in O(n log n). Plus you can increase minPts to get more meaningful clusters. (But do not use OPTICS in Weka, or that python version that is floating around - afaict they are both incomplete or buggy!)

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I'd also like to know, **how the ordering is created** _without scipy dendrogram_ function. No answer was given on this, neither here nor at the two linked questions. There is a paper, which triggered my interest. It should be even possible then, to create a dendrogram by OPTICS, isn't it? – embert Dec 13 '13 at 13:01
  • Yes, there are some papers that discuss the relationship of OPTICS plots to regular dendrograms. – Has QUIT--Anony-Mousse Dec 13 '13 at 13:14
  • I meant specifically [this](http://www.cse.msstate.edu/~niu/papers/PAKDD03.pdf). However, whats the priciple, the leaves are ordered by in the dendrogram? Is it done by walking through the cophenetic matrix somehow? – embert Dec 13 '13 at 14:10
  • Both dendrograms and OPTICS cluster order are essentially *serialized minimum spanning trees* (with different measures than point-to-point distances). – Has QUIT--Anony-Mousse Dec 13 '13 at 15:33