pandas' memory usage for list of SparseSeries

Question

I'm trying to create a list of SparseSeries from a sparse numpy matrix. Creating the lil_matrix is fast and does not consume a lot of memory (in reality my dimension are more in the order of millions, i.e. 15 million samples and 4 million features). I have read a previous topic on this. But that solution as well seems to eat up all my memory, freezing my computer. At the surface it looks like the pandas SparseSeries is not really sparse, or am I doing something wrong? The ultimate goal is to create a SparseDataFrame from this (just like in the other topic I referred to).

from scipy.sparse import lil_matrix, csr_matrix
from numpy import random
import pandas as pd

nsamples = 10**5
nfeatures = 10**4
rm = lil_matrix((nsamples,nfeatures))
for i in xrange(nsamples):
  index = random.randint(0,nfeatures,size=4)
  rm[i,index] = 1 

l=[]
for i in xrange(nsamples):
  l.append(pd.Series(rm[i,:].toarray().ravel()).to_sparse(fill_value=0))

It appears to be sparse: `type(l[0]) Out[313]: pandas.sparse.series.SparseSeries` — JohnE, May 20 '15 at 17:17

JohnE · Answer 1 · 2015-05-20T20:15:07.200

Since your goal is a sparse dataframe, I skipped the Series stage and went straight to a dataframe. I only had the patience to do this on a smaller array size:

nsamples = 10**3 
nfeatures = 10**2

Creation of rm is the same, but I don't load into a list, but rather do this:

df = pd.DataFrame(rm[1,:].toarray().ravel()).to_sparse(0)
for i in xrange(1,nsamples):
    df[i] = rm[i,:].toarray().ravel()

This is unfortunately much slower to run than what you have, but the result is a dataframe, not a list. I played around with this a little and as best I can tell there is not any fast way to build a large, sparse dataframe (even one full of zeros) column by column, rather than all at once (which is not going to be memory efficient). All of the examples in the documentation that I could find start with a dense structure and then convert to sparse in one step.

In any event, this way should be fairly memory efficient by compressing one column at a time such that you never have the full array/dataframe uncompressed at the same time. The resulting dataframe is definitely sparse:

In [39]: type(df)
Out[39]: pandas.sparse.frame.SparseDataFrame

and definitely saves space (almost 25x compression):

In [40]: df.memory_usage().sum()
Out[40]: 31528

In [41]: df.to_dense().memory_usage().sum()
Out[41]: 800000

pandas' memory usage for list of SparseSeries

1 Answers1