0

for a project, I need an efficient function in python that solves to following task:

Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this:

Compare each Element of the two Vectors, always take the lower one and sum them up.

Example:

X = [[0,0,2],   
     [1,0,0],
     [3,1,0]]

Y = [[1,0,2]]

The function should return dist = [2,1,1]

In my project, both X and Y contain a lot of zeros and come in as an instance of:

<class 'scipy.sparse.csr.csr_matrix'>

So far so good and I managed to write a functions that solves this task, but is very slow and horrible inefficient. I need some tips on how to efficienty process/iterate the sparse Matrices. This is my function:

def get_distances(X, Y):
   Ret=[]
   rows, cols = X.shape  

   for i in range(0,rows):
       dist = 0                
       sample = X.getrow(i).todense()
       test = Y.getrow(0).todense()    
       rows_s, cols_s = sample.shape     
       rows_t, cols_t = test.shape 

       for s,t in zip(range(0, cols_s), range(0, cols_t)):
           dist += min(sample[0,s], test[0,t])

       X_ret.append([dist])    

   return ret

To do my Operations, I convert the sparse matrices to dense matrices which is of course horrible, but I did not know how to do it better. Do you know how to improve my code and make the function faster?

Thank you a lot!

  • Not only are you converting the sparse to dense, you are iterating on the dense matrices. Can you step back and solve this with dense arrays, using array operations? Once that's working it should be easier to translate the action to sparse matrices. – hpaulj May 05 '16 at 19:15
  • When I run your code, corrected for `ret`, I get `[[2], [0], [0]]`; also your function does not use `Y` at all. – hpaulj May 05 '16 at 19:40
  • Sorry, I mistyped at the declaration of the varibale "test". I corrected it –  May 05 '16 at 20:24

2 Answers2

0

I revised your function and ran it in

import numpy as np
from scipy import sparse

def get_distances(X, Y):
   ret=[]
   for row in X:            
       sample = row.A
       test = Y.getrow(0).A   
       dist = np.minimum(sample[0,:], test[0,:]).sum()
       ret.append(dist)    
   return ret

X = [[0,0,2],   
     [1,0,0],
     [3,1,0]]

Y = [[1,0,2]]

XM = sparse.csr_matrix(X)
YM = sparse.csr_matrix(Y)

print( get_distances(XM,YM))

print (np.minimum(XM.A, YM.A).sum(axis=1))

producing

1255:~/mypy$ python3 stack37056258.py 
[2, 1, 1]
[2 1 1]

np.minimum takes element wise minimum of two arrays (may be 2d), so I don't need to iterate on columns. I also don't need to use indexing.

minimum is also implemented for sparse matrices, but I get a segmenation error when I try to apply it to your X (with 3 rows) and Y (with 1). If they are the same size this works:

Ys = sparse.vstack((YM,YM,YM))
print(Ys.shape)
print (XM.minimum(Ys).sum(axis=1))

Converting the single row matrix to an array also gets around the error - because it ends up using the dense version, np.minimum(XM.todense(), YM.A).

print (XM.minimum(YM.A).sum(axis=1))

When I try other element by element operations on these 2 matrices I get ValueError: inconsistent shapes, e.g. XM+YM, or XM<YM. Looks like sparse does not implement broadcasting as numpy arrays does.

=======================

Comparison of ways of replicating a 1 row sparse matrix many times

In [271]: A=sparse.csr_matrix([0,1,0,0,1])

In [272]: timeit sparse.vstack([A]*3000).A
10 loops, best of 3: 32.3 ms per loop

In [273]: timeit sparse.kron(A,np.ones((3000,1),int)).A
1000 loops, best of 3: 1.27 ms per loop

For many times, kron is better than vstack.

=======================

There's an overlap in issues with Scipy sparse matrix alternative for getrow()

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you. This helped me a lot. Do you think that stacking the vector and using the minimum-function on sparse matrices is more efficient than the first solution you posted? –  May 06 '16 at 19:08
  • I don't know. It needs to be timed on realistic size matrices. Have you tried the `XM.minimum(YM)` calculation? Do you get a segmentation error like I do? I'm thinking of filing a bug report. – hpaulj May 06 '16 at 19:34
0

Try below code for sparse matrix:

from scipy.sparse import csr_matrix, vstack
X = csr_matrix([[0,0,2],[1,0,0],[3,1,0]])
Y = csr_matrix([[1,0,2]])
def matrix_dist(x,y):
    y=vstack([y]*x.shape[1])
    return (((x+y)-(x-y).multiply((x-y).sign())).sum(1)/2).A.ravel()
Robin Hoo
  • 1
  • 1
  • 1