10

So I am using fastcluster with SciPy to do agglomerative clustering. I can do dendrogram to get the dendrogram for the clustering. I can do fcluster(Z, sqrt(D.max()), 'distance') to get a pretty good clustering for my data. What if I want to manually inspect a region in the dendrogram where say k=3 (clusters) and then I want to inspect k=6 (clusters)? How do I get the clustering at a specific level of the dendrogram?

I see all these functions with tolerances, but I don't understand how to convert from tolerance to number of clusters. I can manually build the clustering using a simple data set by going through the linkage (Z) and piecing the clusters together step by step, but this is not practical for large data sets.

demongolem
  • 9,474
  • 36
  • 90
  • 105

3 Answers3

15

If you want to cut the tree at a specific level, then use:

fl = fcluster(cl,numclust,criterion='maxclust')

where cl is the output of your linkage method and numclust is the number of clusters you want to get.

dkar
  • 2,113
  • 19
  • 29
  • The thing that throws me in the description of fcluster is "and no more than t flat clusters are formed". So are there cases when you get less than `numclust` and if so what would they be? I know that my convoluted way won't give me fewer than the number of clusters I desire. – demongolem Jul 15 '13 at 18:10
  • @demongolem: It is always possible even for your algorithm to return fewer clusters than you have asked for, for example you have 2 data points and you ask for 3 clusters. I have used extensively `fcluster` and I'm not aware of cases where the routine returns fewer clusters under normal conditions. – dkar Jul 15 '13 at 18:48
  • True, not enough points would prevent your request from being fulfilled no matter what. I will just accept it as the way SciPy does business – demongolem Jul 15 '13 at 19:47
  • I suspect that this answer is a bit confusing. maxclust "cut the tree at a specific level" so that we will have at most t clusters. It does NOT cut the tree at a specific height (and gives whatever clusters are formed from below that height). For that, the parameter to use instead is 'distance' – Tal Galili Mar 24 '19 at 12:29
4

Hierarchical clustering allows you to zoom in and out to get fine or coarse grained views of the clustering. So, it might not be clear in advance which level of the dendrogram to cut. A simple solution is to get the cluster membership at every level. It is also possible to select the desired number of clusters.

import numpy as np
from scipy import cluster
np.random.seed(23)
X = np.random.randn(20, 4)
Z = cluster.hierarchy.ward(X)
cutree_all = cluster.hierarchy.cut_tree(Z)
cutree1 = cluster.hierarchy.cut_tree(Z, n_clusters=[5, 10])
print("membership at all levels \n", cutree_all) 
print("membership for 5 and 10 clusters \n", cutree1)
0

Ok so let me propose one way. I don't think it is the right or best way, but at least it is a start.

  1. Choose k we are interested in
  2. Note that linkage Z has N-1 lists where N is the number of data points. The mth list entry will produce N-m clusters. Therefore grab the list in Z with index where k = N-m-1.
  3. Grab the distance value which is the 3rd column of that list
  4. Call fcluster with that particular distance as the tolerance (or perhaps the distance plus some really small delta).

The only problem with this is that there are ties, but really this is not a problem if you can detect that a tie has taken place.

demongolem
  • 9,474
  • 36
  • 90
  • 105
  • Hey @demongolem is ther any way you could help out with this questino that's kind of similar http://stackoverflow.com/questions/36523789/cut-dendrogram-from-hier-clustering-at-distance-height-in-scipy-and-get-cluster – O.rka Apr 10 '16 at 07:25