35

I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:

return DataFrame(matrix.toarray(), columns=features, index=observations)

Is there a way to create a SparseDataFrame() with a scipy.sparse.csc_matrix() or csr_matrix()? Converting to dense format kills RAM badly. Thanks!

Will
  • 24,082
  • 14
  • 97
  • 108
  • 1
    There is now an experimental API: http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse – K3---rnc Mar 10 '16 at 20:24

3 Answers3

30

A direct conversion is not supported ATM. Contributions are welcome!

Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column) and pretty space efficient

In [37]: col = np.array([0,0,1,2,2,2])

In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')

In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )

In [40]: m
Out[40]: 
<3x3 sparse matrix of type '<type 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Column format>

In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) 
                              for i in np.arange(m.shape[0]) ])
Out[46]: 
   0  1  2
0  1  0  4
1  0  0  5
2  2  3  6

In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) 
                                   for i in np.arange(m.shape[0]) ])

In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame
El Developer
  • 3,345
  • 1
  • 21
  • 40
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Awesome, thanks! Just thinking aloud here, but since the SciPy Sparse formats are really just an array of data and two arrays of indices, could we somehow just pupulate the `SparseDataFrame` with that? – Will Jul 23 '13 at 23:33
  • 4
    its best (in the current implementation) to populate per series (column); which then basically creates an internal index (called an int index) or a block index (sort of like bsr/csr) to locate the values. What kinds of operations are you thinking of doing? – Jeff Jul 23 '13 at 23:35
  • Would this be different for a csr matrix or is this still the recommended way? – Sid Nov 03 '15 at 17:27
  • 1
    Jeff, using this method doesn't save memory in my case, calling `df.memory_usage().sum()` is the same if I just created the dataframe like so: `pd.DataFrame(mtx.todense())` . If I however add the `to_sparse` method here `pd.DataFrame(mtx.todense()).to_sparse(fill_value=0)` and call `df.memory_usage().sum()` once again it is less. Maybe this is easy to answer but I'm kinda stuck. – Timothy Dalton Mar 22 '16 at 13:12
  • 1
    not really sure what you are doing, this was on a quite old version. Try with newer pandas, if not pls open an issue/SO question. – Jeff Mar 22 '16 at 13:22
  • it is worth to notice the matrix type in your code is csc and it is slow to get a row from it. It is much faster to either convert the whole matrix to CSR format, or use getcol(i) to get a column instead of row method. If you get the column you need to consider that you get the transposed matrix in df. – eSadr Apr 09 '19 at 23:48
19

As of pandas v 0.20.0 you can use the SparseDataFrame constructor.

An example from the pandas docs:

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)
Alex
  • 18,484
  • 8
  • 60
  • 80
-8

A much shorter version:

df = pd.DataFrame(m.toarray())
Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170
  • 10
    Unfortunately, `toarray()` converts a sparse matrix into a dense matrix, and uses ridiculous amounts of memory. – Will Nov 05 '15 at 04:24
  • 1
    It's simple and short code and for my relatively small dataset the memory consumption was an acceptable tradeoff. – DaReal Aug 27 '19 at 09:46