2

I would like to create a big sparse matrix where its source data can't be fully loaded because of the memory issues. You may think that we have a very big file on disk and we can't read it.

I think about it but I couldn't find a way to create a sparse matrix from a generator.

from scipy.sparse import coo_matrix
matrix1 = coo_matrix(xrange(10)) # it works. Create a sparse matrix with 9 elements.
data = ((0, 1, random.randint(0,5)) for i in xrange(10)) # generator example
matrix2 = coo_matrix(data) # does not work.

Any idea?

Edit: I found this, haven't tried it yet but it looks helpful.

Baskaya
  • 7,651
  • 6
  • 29
  • 27
  • What are you trying to do with this data and matrix? – hpaulj Oct 08 '14 at 13:29
  • The sparse matrix itself can't be a generator. The key data structures for a `coo_matrix` are 3 `numpy` arrays. If the source data can't fit in memory, the sparse matrix can't fit either. – hpaulj Oct 08 '14 at 15:31
  • `pytables` might work - http://stackoverflow.com/questions/11129429 – hpaulj Oct 08 '14 at 15:41
  • 1
    Thanks. Hmmm. Isn't it a bit strange that if the source data can't fit in memory, the sparse matrix does not either. I don't mean that a sparse matrix contains generator. I would like to create a sparse matrix efficiently, row by row. By doing this, I *may* have sparse data which can't fit if I represent as dense. What do you think? – Baskaya Oct 08 '14 at 18:52

1 Answers1

0

Here's an example of using a generator to populate a sparse matrix. I use the generator to fill a structured array, and create the sparse matrix from its fields.

import numpy as np
from scipy import sparse
N, M = 3,4
def foo(N,M):
    # just a simple dense matrix of random data
    cnt = 0
    for i in xrange(N):
        for j in xrange(M):
            yield cnt, (i, j, np.random.random())
            cnt += 1

dt = dt=np.dtype([('i',int), ('j',int), ('data',float)])
X = np.empty((N*M,), dtype=dt)
for cnt, tup in foo(N,M):
    X[cnt] = tup

print X.shape
print X['i']
print X['j']
print X['data']
S = sparse.coo_matrix((X['data'], (X['i'], X['j'])), shape=(N,M))
print S.shape
print S.A

producing something like:

(12,)
[0 0 0 0 1 1 1 1 2 2 2 2]
[0 1 2 3 0 1 2 3 0 1 2 3]
[ 0.99268494  0.89277993  0.32847213  0.56583702  0.63482291  0.52278063
  0.62564791  0.15356269  0.1554067   0.16644956  0.41444479  0.75105334]
(3, 4)
[[ 0.99268494  0.89277993  0.32847213  0.56583702]
 [ 0.63482291  0.52278063  0.62564791  0.15356269]
 [ 0.1554067   0.16644956  0.41444479  0.75105334]]

All of the nonzero data points will exist in memory in 2 forms - the fields of X, and the row,col,data arrays of the sparse matrix.

A structured array like X could also be loaded from the columns of a csv file.

A couple of the sparse matrix formats let you set data elements, e.g.

S = sparse.lil_matrix((N,M))
for cnt, tup in foo(N,M):
    i,j,value = tup
    S[i,j] = value
print S.A

sparse tells me that lil is the least expensive format for this type of assignment.

hpaulj
  • 221,503
  • 14
  • 230
  • 353