25

I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function.

For instance, in sklearn.cluster.AgglomerativeClustering, the only thing I may do is enter an affinity matrix (which will be very memory-heavy). In order to build this very matrix, it is recommended to use sklearn.neighbors.kneighbors_graph, but I don't understand how I can specify a distance function either between two points. Could someone enlighten me?

ali_m
  • 71,714
  • 23
  • 223
  • 298
Mark Morrisson
  • 2,543
  • 4
  • 19
  • 25

3 Answers3

27

All of the scipy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np
from scipy.cluster.hierarchy import fclusterdata

# a custom function that just computes Euclidean distance
def mydist(p1, p2):
    diff = p1 - p2
    return np.vdot(diff, diff) ** 0.5

X = np.random.randn(100, 2)

fclust1 = fclusterdata(X, 1.0, metric=mydist)
fclust2 = fclusterdata(X, 1.0, metric='euclidean')

print(np.allclose(fclust1, fclust2))
# True

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • 1
    Thanks for the tip. I've tried fclusterdata but it failed since it starts by converting the input array to doubles while my array is structured (it contains strings). How could I deal with it? – Mark Morrisson Nov 16 '15 at 13:08
  • Could you post some example data? – ali_m Nov 16 '15 at 13:23
  • 1
    Sure: [(b'FOO', b'67482', 13167), ..., (b'BAR', b'32798', 1369)]. But I was thinking, one way to get around the issue would be to run the clustering method on the indices (i.e. 1,...,n) and use these indices within the distance function to fetch the right data in the initial array. By the way, I don't need the clustering to be hierachical, so I may use a k-means method instead of fclusterdata. – Mark Morrisson Nov 16 '15 at 13:37
  • 1
    I followed my idea and it's working now! The function fclusterdata was adequate because it returns a flat cluster, exactly what I needed. Thank you for your help. – Mark Morrisson Nov 16 '15 at 18:22
  • The example seems to have issue. I tried to run it but `fclust1` and `fclust2` are all ones. Need to set proper threshold. – stanleyli Jun 23 '16 at 04:04
  • @stanleyli That is the expected result, but you're right that setting the threshold lower would give a more realistic example. – ali_m Jun 23 '16 at 08:57
  • @Mark Morrisson, I run into the same situations. Could you show how exactly you made it? Thanks in advance – Diansheng Jul 27 '18 at 03:37
5

sklearn has DBSCAN which allows for precomputed distance matrices (using a triangular matrix where M_ij is the distance between i and j). But this may not be the type of clustering you are looking for.

Additionally, as someone else mentioned, scipy.cluster.hierarchy.fclusterdata also allows precomputed distance metrics. There is a snippet of code given in this reply that gives a bit of code to turn a NxN matrix of distances into a format that fclusterdata can easily read:

import scipy.spatial.distance as ssd
# convert the redundant n*n square matrix form into a condensed nC2 array
    distArray = ssd.squareform(distMatrix) # distArray[{n choose 2}-{n-i choose 2} + (j-i-1)] is the distance between points i and j
Community
  • 1
  • 1
samus
  • 147
  • 2
  • 3
2

For hierarchical clustering, scipy.cluster.hierarchy.fclusterdata allows you to use any of the distance metrics included in the list here via the metric= keyword argument, provided it works with the linkage method you want.

Adam Acosta
  • 603
  • 3
  • 6
  • No, I must define my own distance function (actually by calling geopy). – Mark Morrisson Nov 15 '15 at 16:52
  • Oh, misunderstood. You can do that by building the kneighbors_graph like it's telling you above, but specifying a user-defined metric using `metric=DistanceMetric.get_metric('pyfunc', name_of_my_distance_function)`, by importing the `DistanceMetric` class and writing your own function that computes a distance, provided it's a valid metric. – Adam Acosta Nov 15 '15 at 17:29
  • Thanks for your tip about how to use the distance function but I confess I really don't understand what kneighbors_graph does. I don't even understand the output of the example given in the documentation. I'm not familiar with the use of graphs in that context. For me, it's a matrix of distances between all points that should be provided to a clustering algorithm (or, better, a way to give the algorithm itself a distance function). – Mark Morrisson Nov 15 '15 at 22:31