52

I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:

from scipy.sparse import lil_matrix

x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2

c = 0
for i in x:
  print c, i
  c = c+1

the output is

0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13   (0, 0) 1.0
14 
15   (0, 0) 2.0
16 
17 
18 
19  

so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

and searched around a bit, but I can't seem to find a solution that works.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
RandomGuy
  • 1,658
  • 4
  • 18
  • 21

6 Answers6

80

Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip also improves the speed. Current fastest is using_tocoo_izip:

import scipy.sparse
import random
import itertools

def using_nonzero(x):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        ((row,col), x[row,col])

def using_coo(x):
    cx = scipy.sparse.coo_matrix(x)    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo(x):
    cx = x.tocoo()    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo_izip(x):
    cx = x.tocoo()    
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        (i,j,v)

N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

yields these timeit results:

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    How about using `izip()` instead of `zip()`? Should be faster for big matrices. – Sven Marnach Dec 01 '10 at 12:47
  • nice, didn't know about `izip()`. Actually I'm a bit surprised that `tocoo()` is faster than the `coo_matrix()` constructor... – bbtrb Dec 01 '10 at 14:43
  • 3
    The main point here is to use tocoo, but I strongly recommend no to iterate! All of x.row, x.col and x.data are numpy arrays which can be use as is in most cases. Simple example, setting matrix element value to the product of their indices: x.data[:] = x.col*x.row – Juh_ Mar 13 '13 at 12:14
  • 4
    For python3 you can just use the built-in `zip` function: [python2to3](http://www.diveintopython3.net/porting-code-to-python-3-with-2to3.html) – Featherlegs Apr 18 '17 at 15:42
  • `x,nonzero` first does `x.tocoo()`, and then masks with `A.data!=0` to ensure that there aren't any 'stray' 0s in the matrix. That's why it's slower. – hpaulj Jul 02 '17 at 17:40
  • @hpaulj: Thanks very much for the explanation. For others who are curious, here is the definition of [`nonzero`](https://github.com/scipy/scipy/blob/master/scipy/sparse/base.py#L686) for sparse matrices. – unutbu Jul 02 '17 at 18:30
39

The fastest way should be by converting to a coo_matrix:

cx = scipy.sparse.coo_matrix(x)

for i,j,v in zip(cx.row, cx.col, cx.data):
    print "(%d, %d), %s" % (i,j,v)
bbtrb
  • 4,065
  • 2
  • 25
  • 30
  • Is it faster to convert and then iterate, or is this assuming that I can change to work with coo_matrix? – RandomGuy Dec 01 '10 at 16:16
  • @scandido: this depends on what you are going to achieve. `coo_matrix` is a very simple format and very fast to construct and access but might be ill-suited for other tasks. Here's an overview over the different matrix formats http://www.scipy.org/SciPyPackages/Sparse and especially section "constructing from scratch faster, with coo_matrix" – bbtrb Dec 01 '10 at 16:46
  • 1
    A warning: `cx.row` may be `int32` (see [this question](https://stackoverflow.com/questions/45245745/what-is-the-default-indexing-type-of-scipy-sparse-csr-matrix)), so a subprogram that expects `int64` can misbehave. For example, `pyglpk` gives "TypeError: row/col indices must be ints or strings" -- meaning int64. The fix is simple: `zip( cx.row.astype(int), cx.col.astype(int), cx.data )` – denis Jun 27 '19 at 14:12
5

To loop a variety of sparse matrices from the scipy.sparse code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange and izip for better performance on large matrices):

from scipy.sparse import *
def iter_spmatrix(matrix):
    """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 

    This will always return:
    >>> (row, column, matrix-element)

    Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.

    Parameters
    ----------
    matrix : ``scipy.sparse.sp_matrix``
      the sparse matrix to iterate non-zero elements
    """
    if isspmatrix_coo(matrix):
        for r, c, m in zip(matrix.row, matrix.col, matrix.data):
            yield r, c, m

    elif isspmatrix_csc(matrix):
        for c in range(matrix.shape[1]):
            for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                yield matrix.indices[ind], c, matrix.data[ind]

    elif isspmatrix_csr(matrix):
        for r in range(matrix.shape[0]):
            for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                yield r, matrix.indices[ind], matrix.data[ind]

    elif isspmatrix_lil(matrix):
        for r in range(matrix.shape[0]):
            for c, d in zip(matrix.rows[r], matrix.data[r]):
                yield r, c, d

    else:
        raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
nickpapior
  • 752
  • 1
  • 8
  • 14
2

tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.

from itertools import chain, repeat
def iter_csr(matrix):
  for (row, col, val) in zip(
    chain(*(
          repeat(i, r)
          for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
    )),
    matrix.indices,
    matrix.data
  ):
    yield (row, col, val)

I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).

NB:

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>

So yes, enumerate is somewhat slow(ish)

For the iterator:

In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something

So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.

IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)

Herbert
  • 5,279
  • 5
  • 44
  • 69
  • That's supposed to be `matrix.indptr` rather than `comparisons.indptr`, isn't it? – zwol Aug 12 '16 at 14:12
  • No, `comparisons.indptr[1:] - comparisons.indptr[:-1]` is a vector with the lengths of the rows :) The chain / repeat part is just to iterate through the row indices, which are 0 0 0 ... 0 1 1 ... 1 2 2 ... 2 .... n n .... n (that is how csr works), so I simply repeat the row as much times as the number of elements in the row, and chain these all together. – Herbert Aug 15 '16 at 07:47
  • Uh, there's no variable 'comparisons' in scope! – zwol Aug 15 '16 at 13:46
1

I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)

Davide C
  • 143
  • 1
  • 7
  • Are you proposing using dense matrices all the time, or converting to dense and then iterating? I can't imagine the latter would be faster. But, of course using a dense matrix will be much, much faster if you have enough memory. – RandomGuy Dec 29 '10 at 16:58
  • I guess it depends on the scenario, and the kind of data. I've been doing some prophiling on a script that iterates on matrices containing at least 50-100M boolean elements. When iterating, converting to dense and then iterating requires way less time than iterating using the 'best solution' from unutbu's answer. But of course the memory usage increases A LOT. – Davide C Dec 29 '10 at 23:56
0

Try filter(lambda x:x, x) instead of x.

Conner
  • 30,144
  • 8
  • 52
  • 73
Kabie
  • 10,489
  • 1
  • 38
  • 45