6

I need to calculate the distances between two sets of vectors, source_matrix and target_matrix.

I have the following line, when both source_matrix and target_matrix are of type scipy.sparse.csr.csr_matrix:

distances = sp.spatial.distance.cdist(source_matrix, target_matrix)

And I end up getting the following partial exception traceback:

 File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist
    [XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])
  File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.

Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.

Any advice?

NirIzr
  • 3,131
  • 2
  • 30
  • 49
  • `cdist` expects its arguments to be numpy arrays. It does not handle scipy's sparse matrices. – Warren Weckesser Oct 04 '16 at 03:17
  • @WarrenWeckesser Is there a sparse-friendly alternative to `cdist` then? – NirIzr Oct 04 '16 at 03:24
  • @NirIzr Could you please include a portion of your `source` and `target` matrix? – kmario23 Oct 04 '16 at 04:21
  • 1
    Check out http://stackoverflow.com/questions/36557472/calculate-the-euclidean-distance-in-scipy-csr-matrix - it talks about sparse, and distance.cdist. – hpaulj Oct 04 '16 at 04:42
  • 1
    @NirIzr: See [sklearn.metrics.pairwise.pairwise_distances](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) – Warren Weckesser Oct 04 '16 at 04:50

1 Answers1

6

I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.

Take two random vectors for example

a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output

Or even if a is a matrix and b is a vector:

a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output
   [ 3.33862248],
   [ 3.45803465],
   [ 3.15453179],
   ...

Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs argument to sklearn.metrics.pairwise.pairwise_distances which distributes the computation if your vectors are very large.

Hope that helps

PyRsquared
  • 6,970
  • 11
  • 50
  • 86
  • I recall using pairwise_distances didn't turn out too well for me, but can't really say why. Therefore, I'm accepting but not upvoting this answer in hopes it'll be upvoted by users finding it helpful. – NirIzr Oct 26 '17 at 15:18
  • 2
    Checked using `timeit`, `cdist` is over twice as fast as `pairwise_distances`. – EZLearner Aug 02 '19 at 19:43