Fastest way to sum over rows of sparse matrix

Question

I have a big csr_matrix(1M*1K) and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. Actually my problem is exactly same as this Sum over rows in scipy.sparse.csr_matrix. The only thing is I find the accepted solution to be slow for my purpose. Let me state what I have

map_fn = np.random.randint(0, 10000, 1000000)

map_fn here tells me how my input rows(1M) are mapped into my output rows(10K). For example ith input row gets added up into map_fn[i] output row. I tried the two approaches mentioned in the above question, namely forming a sparse matrix and using sparse sum. Although the sparse matrix approach looks way better than sparse sum approach but I find it slow for my purpose. Here is the code comparing two approaches:

import scipy.sparse
import numpy as np 
import time

print "Setting up input"
s=10000
n=1000000
d=1000
density=1.0/500

X=scipy.sparse.rand(n,d,density=density,format="csr")
map_fn=np.random.randint(0, s, n)

# Approach 1
start_time=time.time()
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
print "Approach 1 Creation time : ",time.time()-start_time
SX = S.dot(X)
print "Approach 1 Total time : ",time.time()-start_time

#Approach 2
start_time=time.time()
SX = np.zeros((s,X.shape[1]))
for i in range(SX.shape[0]):
    SX[i,:] = X[np.where(map_fn==i)[0],:].sum(axis=0)

print "Approach 2 Total time : ",time.time()-start_time

which gives following numbers:

Approach 1 Creation time :  0.187678098679
Approach 1 Total time :  0.286989927292
Approach 2 Total time :  10.208632946

So my question is this is there a better way of doing this? I find forming sparse matrix to be an overkill as it takes more than half of the time. Are there any better alternatives? Any suggestions are greatly appreciated. Thanks

Did you try Sparse-sum : https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.sum.html? — Divakar, Sep 28 '17 at 18:23
*Gasp! A whole 433ms to run through only ten billion items!* --- You might want to write your own subroutines (in a lower-level language?) for something like this. — SIGSTACKFAULT, Sep 28 '17 at 18:26
It would be easier to test alternatives if you gave a full working example. I started to set things up, but gave up after the dimensions got too confusing. — hpaulj, Sep 29 '17 at 03:12
@hpaulj Sorry for not being clear. I have added working example. Let me know if you have any questions. Thanks a lot. — user1131274, Sep 29 '17 at 13:06
That sparse `X` slicing is actually done with a matrix product. I describe that process in https://stackoverflow.com/questions/39500649/sparse-matrix-slicing-using-list-of-int. So your iteration is replacing one `S.dot` with `s` smaller ones. And the sparse `sum` is also a matrix product. — hpaulj, Sep 29 '17 at 18:34
@hpaulj I see. Clearly this means approach2 can not be improved over approach 1. But does this mean that there is no better alternative as well? Because as I said its taking more than half of the time to just create S from map_fn. Can we use map_fn directly somehow? Thanks a lot. — user1131274, Sep 29 '17 at 19:11

Divakar · Accepted Answer · 2017-09-30T22:26:44.917

Starting approach

Adapting sparse solution from this post -

def sparse_matrix_mult_sparseX_mod1(X, rows):   
    nrows = rows.max()+1
    ncols = X.shape[1]
    nelem = nrows * ncols

    a,b = X.nonzero()
    ids = rows[a] + b*nrows
    sums = np.bincount(ids, X[a,b].A1, minlength=nelem)
    out = sums.reshape(ncols,-1).T
    return out

Benchmarking

Original approach #1 -

def app1(X, map_fn):
    col = scipy.arange(n)
    val = np.ones(n)
    S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
    SX = S.dot(X)
    return SX

Timings and verification -

In [209]: # Inputs setup
     ...: s=10000
     ...: n=1000000
     ...: d=1000
     ...: density=1.0/500
     ...: 
     ...: X=scipy.sparse.rand(n,d,density=density,format="csr")
     ...: map_fn=np.random.randint(0, s, n)
     ...: 

In [210]: out1 = app1(X, map_fn)
     ...: out2 = sparse_matrix_mult_sparseX_mod1(X, map_fn)
     ...: print np.allclose(out1.toarray(), out2)
     ...: 
True

In [211]: %timeit app1(X, map_fn)
1 loop, best of 3: 517 ms per loop

In [212]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
10 loops, best of 3: 147 ms per loop

To be fair, we should time the final dense array version from app1 -

In [214]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 584 ms per loop

Porting to Numba

We could translate the binned counting step to numba, which might be beneficial for denser input matrices. One of the ways to do so would be -

from numba import njit

@njit
def bincount_mod2(out, rows, r, C, V):
    N = len(V)
    for i in range(N):
        out[rows[r[i]], C[i]] += V[i]
    return out

def sparse_matrix_mult_sparseX_mod2(X, rows):
    nrows = rows.max()+1
    ncols = X.shape[1]
    r,C = X.nonzero()

    V = X[r,C].A1
    out = np.zeros((nrows, ncols))
    return bincount_mod2(out, rows, r, C, V)

Timings -

In [373]: # Inputs setup
     ...: s=10000
     ...: n=1000000
     ...: d=1000
     ...: density=1.0/100 # Denser now!
     ...: 
     ...: X=scipy.sparse.rand(n,d,density=density,format="csr")
     ...: map_fn=np.random.randint(0, s, n)
     ...: 

In [374]: %timeit app1(X, map_fn)
1 loop, best of 3: 787 ms per loop

In [375]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
1 loop, best of 3: 906 ms per loop

In [376]: %timeit sparse_matrix_mult_sparseX_mod2(X, map_fn)
1 loop, best of 3: 705 ms per loop

With the dense output from app1 -

In [379]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 910 ms per loop

@Thank you for introducing me to numba. – user1131274 Oct 02 '17 at 01:40 — user1131274, Oct 02 '17 at 01:40

Fastest way to sum over rows of sparse matrix

1 Answers1

Linked