3

I have an extremely sparse structured matrix. My matrix has exactly one non zero entry per column. But its huge(10k*1M) and given in following form(uisng random values for example)

rows = np.random.randint(0, 10000, 1000000)
values = np.random.randint(0,10,1000000)

where rows gives us the row number for nonzero entry in each column. I want fast matrix multiplication with S and I am doing following right now - I convert this form to a sparse matrix(S) and do S.dot(X) for multiplication with matrix X(which can be sparse or dense).

S=scipy.sparse.csr_matrix( (values, (rows, scipy.arange(1000000))), shape = (10000,1000000))

For X of size 1M * 2500 and nnz(X)=8M this takes 178ms to create S and 255 ms to apply it. So my question is this what is the best way of doing SX (where X could be sparse or dense) given my S is as described. Since creating S is itself very time consuming I was thinking of something adhoc. I did try creating something using loops but its not even close.
Simple looping procedure looks something like this

SX = np.zeros((rows.size,X.shape[1])) for i in range(X.shape[0]): SX[rows[i],:]+=values[i]*X[i,:] return SX
Can we make this efficient?

Any suggestions are greatly appreciated. Thanks

user1131274
  • 493
  • 6
  • 20
  • Those timings are very good considering how large your matrix is. I don't think you can squeeze any more performance other than resorting to using a completely different framework. – rayryeng Sep 27 '17 at 20:36
  • @rayryeng I kind of find it slow because for example doing X^TX in the example I gave is taking 450ms which is equal to time taken in doing SA. Given that doing SA is O(mn) and doing X^TX is O(mn^2) when X is m*n, I find doing SA not fast enough. Also I know I have not taken sparsity of X into account but I have similar number for dense X. Thanks – user1131274 Sep 27 '17 at 20:46
  • With only one value per column, you might do better by using `rows` to index `X`, and then a dense `dot` with `values`. – hpaulj Sep 27 '17 at 20:49

3 Answers3

3

Approach #1

Given that there's exactly one entry per column in the first input, we could use np.bincount using inputs - rows, values and X and thus also avoids creating sparse matrix S -

def sparse_matrix_mult(rows, values, X):
    nrows = rows.max()+1
    ncols = X.shape[1]
    nelem = nrows * ncols

    ids = rows + nrows*np.arange(ncols)[:,None]
    sums = np.bincount(ids.ravel(), (X.T*values).ravel(), minlength=nelem)
    out = sums.reshape(ncols,-1).T
    return out

Sample run -

In [746]: import numpy as np
     ...: from scipy.sparse import csr_matrix
     ...: import scipy as sp
     ...: 

In [747]: np.random.seed(1234)
     ...: m,n = 3,4
     ...: rows = np.random.randint(0, m, n)
     ...: values = np.random.randint(2,10,n)
     ...: X = np.random.randint(2, 10, (n,5))
     ...: 

In [748]: S = csr_matrix( (values, (rows, sp.arange(n))), shape = (m,n))

In [749]: S.dot(X)
Out[749]: 
array([[42, 27, 45, 78, 87],
       [24, 18, 18, 12, 24],
       [18,  6,  8, 16, 10]])

In [750]: sparse_matrix_mult(rows, values, X)
Out[750]: 
array([[ 42.,  27.,  45.,  78.,  87.],
       [ 24.,  18.,  18.,  12.,  24.],
       [ 18.,   6.,   8.,  16.,  10.]])

Approach #2

Using np.add.reduceat to replace np.bincount -

def sparse_matrix_mult_v2(rows, values, X):
    nrows = rows.max()+1
    ncols = X.shape[1]

    scaled_ar = X*values[:,None]
    sidx = rows.argsort()
    rows_s = rows[sidx]
    cut_idx = np.concatenate(([0],np.flatnonzero(rows_s[1:] != rows_s[:-1])+1))
    sums = np.add.reduceat(scaled_ar[sidx],cut_idx,axis=0)

    out = np.empty((nrows, ncols),dtype=sums.dtype)
    row_idx = rows_s[cut_idx]
    out[row_idx] = sums
    return out

Runtime test

I could not run it on the sizes mentioned in the question, as those were too big for me to handle. So, on reduced datasets, here's what I am getting -

In [149]: m,n = 1000, 100000
     ...: rows = np.random.randint(0, m, n)
     ...: values = np.random.randint(2,10,n)
     ...: X = np.random.randint(2, 10, (n,2500))
     ...: 

In [150]: S = csr_matrix( (values, (rows, sp.arange(n))), shape = (m,n))

In [151]: %timeit csr_matrix( (values, (rows, sp.arange(n))), shape = (m,n))
100 loops, best of 3: 16.1 ms per loop

In [152]: %timeit S.dot(X)
1 loop, best of 3: 193 ms per loop

In [153]: %timeit sparse_matrix_mult(rows, values, X)
1 loop, best of 3: 4.4 s per loop

In [154]: %timeit sparse_matrix_mult_v2(rows, values, X)
1 loop, best of 3: 2.81 s per loop

So, the proposed methods don't seem to over-power numpy.dot on performance, but they should be good on memory efficiency.


For sparse X

For sparse X, we need some modifications, as listed in the modified method listed below -

from scipy.sparse import find
def sparse_matrix_mult_sparseX(rows, values, Xs): # Xs is sparse    
    nrows = rows.max()+1
    ncols = Xs.shape[1]
    nelem = nrows * ncols

    scaled_vals = Xs.multiply(values[:,None])
    r,c,v = find(scaled_vals)
    ids = rows[r] + c*nrows
    sums = np.bincount(ids, v, minlength=nelem)
    out = sums.reshape(ncols,-1).T
    return out
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks. I will try them out. Just one thing X.T*values seems to be doing dot product instead of multiply when X is sparse. – user1131274 Sep 27 '17 at 22:15
  • @user1131274 The proposed methods assume `X` as dense. If you are working with sparse, one way would be to convert to dense and then multiply with `values`. So, `X.toarray().T*values` and so on. – Divakar Sep 27 '17 at 22:21
  • X is very large to make it dense. Also whats your XtX time. The reason I am asking this is because my SX time is same as XtX which I find strange since XtX is O(nd^2) and SX is O(nd) in theory where d=2500 in our case. Thats why I feel there should be better way. Thanks for your effort. – user1131274 Sep 27 '17 at 22:32
  • @user1131274 Also, what's `XtX`? – Divakar Sep 28 '17 at 04:52
  • XtX is (X^T)*X. Also I have edited, adding what I had in mind. I think we can parallelize my approach on output row index. Because if we look at it all we have to do is partition the rows of X into rows of output matrix. Can you please look at my approach and tell me if something could be done there. I don't know python parallelization libraries. Thanks – user1131274 Sep 28 '17 at 16:54
  • @user1131274 I am sort of lost here. Why are we doing `(X^T)*X` again? I thought we were interested in `S.dot(X)` equivalent result. – Divakar Sep 28 '17 at 17:52
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155543/discussion-between-user1131274-and-divakar). – user1131274 Sep 28 '17 at 17:55
1

Inspired by this post Fastest way to sum over rows of sparse matrix, I have found the best way to do this is to write loops and port things to numba. Here is the

`

@njit
def sparse_mul(SX,row,col,data,values,row_map):
    N = len(data)
    for idx in range(N):
        SX[row_map[row[idx]],col[idx]]+=data[idx]*values[row[idx]]
    return SX
X_coo=X.tocoo()
s=row_map.max()+1
SX = np.zeros((s,X.shape[1]))
sparse_mul(SX,X_coo.row,X_coo.col,X_coo.data,values,row_map)`

Here row_map is the rows in the question. On X of size (1M* 1K), 1% sparsity and with s=10K this performs twice as good as forming sparse matrix from row_map and doing S.dot(A).

user1131274
  • 493
  • 6
  • 20
0

As I recall, Knuth TAOP talks about representing a sparse matrix instead as a linked list of (for your app) non-zero values. Maybe something like that? Then traverse the linked list rather than the entire array by each dimension.

Mark Diaz
  • 193
  • 1
  • 10
  • 1
    That probably will not help. Sparse matrix representation and sparse matrix multiplication in `scipy` is already very well optimized so using an unoptimized representation of a sparse matrix to do the multiplication will probably take longer than what is currently being benchmarked. (FYI, I did not downvote). – rayryeng Sep 27 '17 at 20:34
  • And no worries about any down votes - if that is what the answer deserves. No one who is a successful programmer thinks ego is more important than reality. – Mark Diaz Sep 27 '17 at 23:27