Python - Efficient Function with scipy sparse Matrices

Question

for a project, I need an efficient function in python that solves to following task:

Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this:

Compare each Element of the two Vectors, always take the lower one and sum them up.

Example:

X = [[0,0,2],   
     [1,0,0],
     [3,1,0]]

Y = [[1,0,2]]

The function should return dist = [2,1,1]

In my project, both X and Y contain a lot of zeros and come in as an instance of:

<class 'scipy.sparse.csr.csr_matrix'>

So far so good and I managed to write a functions that solves this task, but is very slow and horrible inefficient. I need some tips on how to efficienty process/iterate the sparse Matrices. This is my function:

def get_distances(X, Y):
   Ret=[]
   rows, cols = X.shape  

   for i in range(0,rows):
       dist = 0                
       sample = X.getrow(i).todense()
       test = Y.getrow(0).todense()    
       rows_s, cols_s = sample.shape     
       rows_t, cols_t = test.shape 

       for s,t in zip(range(0, cols_s), range(0, cols_t)):
           dist += min(sample[0,s], test[0,t])

       X_ret.append([dist])    

   return ret

To do my Operations, I convert the sparse matrices to dense matrices which is of course horrible, but I did not know how to do it better. Do you know how to improve my code and make the function faster?

Thank you a lot!

Not only are you converting the sparse to dense, you are iterating on the dense matrices. Can you step back and solve this with dense arrays, using array operations? Once that's working it should be easier to translate the action to sparse matrices. — hpaulj, May 05 '16 at 19:15
When I run your code, corrected for `ret`, I get `[[2], [0], [0]]`; also your function does not use `Y` at all. — hpaulj, May 05 '16 at 19:40
Sorry, I mistyped at the declaration of the varibale "test". I corrected it — , May 05 '16 at 20:24

score 0 · Accepted Answer · edited May 23 '17 at 12:16

I revised your function and ran it in

import numpy as np
from scipy import sparse

def get_distances(X, Y):
   ret=[]
   for row in X:            
       sample = row.A
       test = Y.getrow(0).A   
       dist = np.minimum(sample[0,:], test[0,:]).sum()
       ret.append(dist)    
   return ret

X = [[0,0,2],   
     [1,0,0],
     [3,1,0]]

Y = [[1,0,2]]

XM = sparse.csr_matrix(X)
YM = sparse.csr_matrix(Y)

print( get_distances(XM,YM))

print (np.minimum(XM.A, YM.A).sum(axis=1))

producing

1255:~/mypy$ python3 stack37056258.py 
[2, 1, 1]
[2 1 1]

np.minimum takes element wise minimum of two arrays (may be 2d), so I don't need to iterate on columns. I also don't need to use indexing.

minimum is also implemented for sparse matrices, but I get a segmenation error when I try to apply it to your X (with 3 rows) and Y (with 1). If they are the same size this works:

Ys = sparse.vstack((YM,YM,YM))
print(Ys.shape)
print (XM.minimum(Ys).sum(axis=1))

Converting the single row matrix to an array also gets around the error - because it ends up using the dense version, np.minimum(XM.todense(), YM.A).

print (XM.minimum(YM.A).sum(axis=1))

When I try other element by element operations on these 2 matrices I get ValueError: inconsistent shapes, e.g. XM+YM, or XM<YM. Looks like sparse does not implement broadcasting as numpy arrays does.

=======================

Comparison of ways of replicating a 1 row sparse matrix many times

In [271]: A=sparse.csr_matrix([0,1,0,0,1])

In [272]: timeit sparse.vstack([A]*3000).A
10 loops, best of 3: 32.3 ms per loop

In [273]: timeit sparse.kron(A,np.ones((3000,1),int)).A
1000 loops, best of 3: 1.27 ms per loop

For many times, kron is better than vstack.

=======================

There's an overlap in issues with Scipy sparse matrix alternative for getrow()

Thank you. This helped me a lot. Do you think that stacking the vector and using the minimum-function on sparse matrices is more efficient than the first solution you posted? — , May 06 '16 at 19:08
I don't know. It needs to be timed on realistic size matrices. Have you tried the `XM.minimum(YM)` calculation? Do you get a segmentation error like I do? I'm thinking of filing a bug report. — hpaulj, May 06 '16 at 19:34

Robin Hoo · Answer 2 · 2018-01-12T12:14:47.283

0

Try below code for sparse matrix:

from scipy.sparse import csr_matrix, vstack
X = csr_matrix([[0,0,2],[1,0,0],[3,1,0]])
Y = csr_matrix([[1,0,2]])
def matrix_dist(x,y):
    y=vstack([y]*x.shape[1])
    return (((x+y)-(x-y).multiply((x-y).sign())).sum(1)/2).A.ravel()

edited Jan 12 '18 at 12:14

answered Jan 12 '18 at 07:24

Robin Hoo

1
1
1

vstack is from scipy.sparse not numpy – Robin Hoo Jan 12 '18 at 07:26

Python - Efficient Function with scipy sparse Matrices

2 Answers2

Linked