3

I'm using sklearn.cluster.AgglomerativeClustering. It begins with one cluster per data point and iteratively merges together the two "closest" clusters, thus forming a binary tree. What constitutes distance between clusters depends on a linkage parameter.

It would be useful to know the distance between the merged clusters at each step. We could then stop when the next to be merged clusters get too far apart. Alas, that does not seem to be available in AgglomerativeClustering.

Am I missing something? Is there a way to recover the distances?

tttthomasssss
  • 5,852
  • 3
  • 32
  • 41
Eduardo
  • 1,235
  • 16
  • 27
  • 1
    Can you be more specific what you mean by `distance` - i.e. do you simply mean the distance between the centroids of the corresponding clusters or something different? – tttthomasssss Sep 04 '17 at 04:37
  • I don't want to be specific. It could be distance between centroids as you suggest, or smallest distance between two points in separate clusters, as in single-link, or resulting cluster diameter, or increase in variance. The point is that any agglomerative clustering method merges the two "closest" clusters in each iteration. That "closeness" measure can be computed in different ways but has a definite, increasing, value at each merge. It would be useful to know those values. – Eduardo Sep 04 '17 at 11:06

2 Answers2

7

You might want to take a look at scipy.cluster.hierarchy which offers somewhat more options than sklearn.cluster.AgglomerativeClustering.

The clustering is done with the linkage function which returns a matrix containing the distances between the merged clusters. These can be visualised with a dendrogram:

from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, cl = make_blobs(n_samples=20, n_features=2, centers=3, cluster_std=0.5, random_state=0)
Z = linkage(X, method='ward')

plt.figure()
dendrogram(Z)
plt.show()

dendrogram.png

One can form flat clusters from the linkage matrix based on various criteria, e.g. the distance of observations:

clusters = fcluster(Z, 5, criterion='distance')

Scipy's hierarchical clustering is discussed in much more detail here.

σηγ
  • 1,294
  • 1
  • 8
  • 15
1

When this question was originally asked, and when the other answer was posted, sklearn did not expose the distances. It now does, however, as demonstrated in this example and this answer to a similar question.

erobertc
  • 644
  • 1
  • 9
  • 20