1

I have a very large sparse matrix (few million rows, 500 columns). I have already cumputed a distance matrix of 5000X5000. I need to use scipy.cluster.hierarchy.linkage to get the clustering according to this matrix. I know that linkage accepts a custom function, but computing this distance matrix again is very time consuming.
How can I tell scipy to use the distances by the matrix? I tried

dist = my_dist(X) # numpy array ndim = 2
linkage(X, metric=lambda x: dist[x,y])

but the x,y passed are the values and not the indexes.

DeanLa
  • 1,871
  • 3
  • 21
  • 37

1 Answers1

4

You can pass the distance matrix to linkage if you represent it as a "condensed" distance matrix. You can use scipy.spatial.squareform to convert dist to the condensed representation.

Something like this:

from scipy.spatial.distance import squareform

dist = my_dist(X)
condensed_dist = squareform(dist)
linkresult = linkage(condensed_dist)
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214