Using pytables, which is more efficient: scipy.sparse or numpy dense matrix?

Question

When using pytables, there's no support (as far as I can tell) for the scipy.sparse matrix formats, so to store a matrix I have to do some conversion, e.g.

def store_sparse_matrix(self):
    grp1 = self.getFileHandle().createGroup(self.getGroup(), 'M')
    self.getFileHandle().createArray(grp1, 'data', M.tocsr().data)
    self.getFileHandle().createArray(grp1, 'indptr', M.tocsr().indptr)
    self.getFileHandle().createArray(grp1, 'indices', M.tocsr().indices)

def get_sparse_matrix(self):
    return sparse.csr_matrix((self.getGroup().M.data, self.getGroup().M.indices, self.getGroup().M.indptr))

The trouble is that the get_sparse function takes some time (reading from disk), and if I understand it correctly also requires the data to fit into memory.

The only other option seems to convert the matrix to dense format (numpy array) and then use pytables normally. However this seems to be rather inefficient, although I suppose perhaps pytables will deal with the compression itself?

Using a regular NumPy array will certainly require the whole matrix, including zeros, to fit in memory. You also won't be able to exploit sparsity in your algorithms. But what exactly is your question? — Fred Foo, Jan 17 '12 at 13:15
@larsmans but when using a NumPy array in combination with pytables, my understanding is that it only loads from disk when necessary, and therefore the whole matrix does not have to fit in memory. However it does require that the whole matrix, including zeros, is stored on disk (and therefore read/written to disk when necessary too). It seems like this would create unnecessary overhead, when in reality only a few values might need to be read. However without native support for scipy.sparse I can't see how to avoid this? — tdc, Jan 17 '12 at 13:26
So the question is, "will this demand-load the CSR matrix?" I can't answer that since I don't know PyTables. I do know that you can build a CSR matrix backed by mmap'd arrays... — Fred Foo, Jan 17 '12 at 13:37
Did you ever find a good solution to this? I'm encountering the same problem and am almost resigned to just converting to dense format in order to use pytables normally. — Jesse Sherlock, Aug 16 '12 at 01:22
@JesseSherlock Unfortunately not really. In the end we've decided to ditch pytables, as it's really not suited to sparse matrices at all, and it was more important for us to be able to use scipy. The solution we have is to serialize rows of the matrix to disk (actually mongodb) independently in sparse format (column indices and data). That way if a row of the matrix changes, that row is all that needs to be flushed. However this is somewhat inflexible to changes in the columns, which require a complete flush (or some headache-making code ...), but that's a rare occurrence in our application. — tdc, Aug 16 '12 at 10:12
You are dealing with a single sparse matrix which does not fit into ram? That's rather hardcore. The fact that you are even thinking about treating the same matrix as dense in any sort of way, means that one of the two of us is missing something here. — Eelco Hoogendoorn, Dec 27 '13 at 22:12
As for incrementally updating the matrix on disk; depending on the structure of your matrix, the chunked-storage of pytables might suit your needs just fine. Only chunks with nonzero values actually get stored to disk. — Eelco Hoogendoorn, Dec 27 '13 at 22:17
Another possibility that just occurred to me, which should address all your needs, if I understand them correctly; you should be able to store the matrix in a coordinate format (i,j,data) in pytables. Then if you index both the i and j columns, updates to both rows and columns using table.where() should be very efficient, no matter how large your matrix. — Eelco Hoogendoorn, Dec 31 '13 at 09:41

score 2 · Answer 1 · edited May 23 '17 at 11:46

2

Borrowing from Storing numpy sparse matrix in HDF5 (PyTables), you can marshal a scipy.sparse array into a pytables format using its data, indicies, and indptr attributes, which are three regular numpy.ndarray objects.

edited May 23 '17 at 11:46

Community

1
1

answered Jan 23 '14 at 20:54

IanSR

9,898
4
14
15

1

The trouble with this approach is that you'd have to deserialize the whole thing to use the scipy matrix operations on it, or write your own equivalent that work on the data items in pytables. – tdc Jan 24 '14 at 09:57
1

Also see [my answer](http://stackoverflow.com/a/22589030/2858145) on the question you link... you have to store the `shape` attribute too. – Pietro Battiston Apr 03 '14 at 15:26

Using pytables, which is more efficient: scipy.sparse or numpy dense matrix?

1 Answers1

Linked