Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix

Question

I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:

return DataFrame(matrix.toarray(), columns=features, index=observations)

Is there a way to create a SparseDataFrame() with a scipy.sparse.csc_matrix() or csr_matrix()? Converting to dense format kills RAM badly. Thanks!

There is now an experimental API: http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse — K3---rnc, Mar 10 '16 at 20:24

score 30 · Accepted Answer · edited Apr 28 '15 at 21:45

30

A direct conversion is not supported ATM. Contributions are welcome!

Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column) and pretty space efficient

In [37]: col = np.array([0,0,1,2,2,2])

In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')

In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )

In [40]: m
Out[40]: 
<3x3 sparse matrix of type '<type 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Column format>

In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) 
                              for i in np.arange(m.shape[0]) ])
Out[46]: 
   0  1  2
0  1  0  4
1  0  0  5
2  2  3  6

In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) 
                                   for i in np.arange(m.shape[0]) ])

In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame

edited Apr 28 '15 at 21:45

El Developer

3,345
1
21
40

answered Jul 23 '13 at 19:32

Jeff

125,376
21
220
187

Awesome, thanks! Just thinking aloud here, but since the SciPy Sparse formats are really just an array of data and two arrays of indices, could we somehow just pupulate the `SparseDataFrame` with that? – Will Jul 23 '13 at 23:33
4

its best (in the current implementation) to populate per series (column); which then basically creates an internal index (called an int index) or a block index (sort of like bsr/csr) to locate the values. What kinds of operations are you thinking of doing? – Jeff Jul 23 '13 at 23:35
Would this be different for a csr matrix or is this still the recommended way? – Sid Nov 03 '15 at 17:27
1

Jeff, using this method doesn't save memory in my case, calling `df.memory_usage().sum()` is the same if I just created the dataframe like so: `pd.DataFrame(mtx.todense())` . If I however add the `to_sparse` method here `pd.DataFrame(mtx.todense()).to_sparse(fill_value=0)` and call `df.memory_usage().sum()` once again it is less. Maybe this is easy to answer but I'm kinda stuck. – Timothy Dalton Mar 22 '16 at 13:12
1

not really sure what you are doing, this was on a quite old version. Try with newer pandas, if not pls open an issue/SO question. – Jeff Mar 22 '16 at 13:22
it is worth to notice the matrix type in your code is csc and it is slow to get a row from it. It is much faster to either convert the whole matrix to CSR format, or use getcol(i) to get a column instead of row method. If you get the column you need to consider that you get the transposed matrix in df. – eSadr Apr 09 '19 at 23:48

score 19 · Answer 2 · answered Jun 07 '17 at 21:43

As of pandas v 0.20.0 you can use the SparseDataFrame constructor.

An example from the pandas docs:

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)

score -8 · Answer 3 · answered Nov 04 '15 at 06:47

-8

A much shorter version:

df = pd.DataFrame(m.toarray())

answered Nov 04 '15 at 06:47

Boris Gorelik

29,945
39
128
170

10

Unfortunately, `toarray()` converts a sparse matrix into a dense matrix, and uses ridiculous amounts of memory. – Will Nov 05 '15 at 04:24
1

It's simple and short code and for my relatively small dataset the memory consumption was an acceptable tradeoff. – DaReal Aug 27 '19 at 09:46

Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix

3 Answers3

Linked