Efficient incremental sparse matrix in python / scipy / numpy

Question

Is there a way in Python to have an efficient incremental update of sparse matrix?

 H = lil_matrix((n,m))
 for (i,j) in zip(A,B):
   h(i,j) += compute_something

It seems that such a way to build a sparse matrix is quite slow (lil_matrix is the fastest sparse matrix type for that).

Is there a way (like using dict of dict or other kind of approaches) to efficiently build the sparse matrix H?

The size of the sparse matrix is about 100000 x 100000. So dim(A) can be close to 10^6. — francois rousseau, Mar 03 '16 at 13:40

score 4 · Answer 1 · edited May 23 '17 at 12:16

In https://stackoverflow.com/a/27771335/901925 I explore incremental matrix assignment.

lol and dok are the recommended formats if you want to change values. csr will give you an efficiency warning, and coo does not allow indexing.

But I also found that dok indexing is slow compared to regular dictionary indexing. So for many changes it is better to build a plain dictionary (with the same tuple indexing), and build the dok matrix from that.

But if you can calculate the H data values with a fast numpy vector operation, as opposed to iteration, it is best to do so, and construct the sparse matrix from that (e.g. coo format). In fact even with iteration this would be faster:

 h = np.zeros(A.shape)
 for k, (i,j) in enumerate(zip(A,B)):
    h[k] = compute_something 
 H = sparse.coo_matrix((h, (A, B)), shape=(n,m))

e.g.

In [780]: A=np.array([0,1,1,2]); B=np.array([0,2,2,1])
In [781]: h=np.zeros(A.shape)
In [782]: for k, (i,j) in enumerate(zip(A,B)):
    h[k] = i+j+k
   .....:     
In [783]: h
Out[783]: array([ 0.,  4.,  5.,  6.])
In [784]: M=sparse.coo_matrix((h,(A,B)),shape=(4,4))
In [785]: M
Out[785]: 
<4x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in COOrdinate format>
In [786]: M.A
Out[786]: 
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  9.,  0.],
       [ 0.,  6.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Note that the (1,2) value is the sum 4+5. That's part of the coo to csr conversion.

In this case I could have calculated h with:

In [791]: A+B+np.arange(A.shape[0])
Out[791]: array([0, 4, 5, 6])

so there's no need for iteration.

score 1 · Answer 2 · answered Mar 03 '16 at 14:06

1

Nope, do not use csr_matrix or csc_matrix, as they are going to be even more slower than lil_matrix, if you construct them incrementally. The Dictionary of Key based sparse matrix is exactly what you are looking for

from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
    for j in range(5):
        S[i,j] = i+j    # Update elements

answered Mar 03 '16 at 14:06

romeric

2,325
3
19
35

I found in other SO questions that indexing a plain dictionary (same keys) is faster than indexing a dok dictionary. – hpaulj Mar 03 '16 at 14:32
I've tested lil, csc, csr and dok matrices. lil_matrix was the fastest by far. – francois rousseau Mar 03 '16 at 16:47
1

I believe that. If you look at the implementation of scipy's dok format, it is essentially a wrapper over dict. Moreover, the __getitem__ and __setitem__ methods of dok are written in pure Python so indexing them will introduce more overhead. However, compared to other sparse formats, dok is relatively cheaper to iterate over as item access is algorithmically still O(1)/constant. Honestly, I think sparse matrices are not really meant to be iterated over. – romeric Mar 03 '16 at 21:53

score 0 · Answer 3 · answered Mar 03 '16 at 14:06

0

A faster way would be:

H_ij = compute_something_vectorized()
H = coo_matrix((H_ij, (A, B))).tocsr()

The data for duplicate coordinates are then summed, see the docs for coo_matrix.

answered Mar 03 '16 at 14:06

This should be best IF the vectorized function can be made. – hpaulj Mar 03 '16 at 14:36
I'd like to vectorize this function but I'm not sure I can ! – francois rousseau Mar 03 '16 at 16:47

Efficient incremental sparse matrix in python / scipy / numpy

3 Answers3

Linked