32

The pandas read_csv function doesn't seem to have a sparse option. I have csv data with a ton of zeros in it (it compresses very well, and stripping out any 0 value reduces it to almost half the original size).

I've tried loading it into a dense matrix first with read_csv and then calling to_sparse, but it takes a long time and chokes on text fields, although most of the data is floating point. If I call pandas.get_dummies(df) first to convert the categorical columns to ones & zeros, then call to_sparse(fill_value=0) it takes an absurd amount of time, much longer than I would expect for a mostly numeric table that has 12 million entries, mostly zero. This happens even if I strip the zeros out of the original file and call to_sparse() (so that the fill value is NaN). This also happens regardless of whether I pass kind='block' or kind='integer'.

Other than building the sparse dataframe by hand, is there a good, smooth way to load a sparse csv directly without eating up gobs of unnecessary memory?


Here is some code to create a sample dataset that has 3 columns of floating point data and one column of text data. Approximately 85% of the float values are zero and the total size of the CSV is approximately 300 MB but you will probably want to make this larger to really test the memory constraints.

np.random.seed(123)
df=pd.DataFrame( np.random.randn(10000000,3) , columns=list('xyz') )
df[ df < 1.0 ] = 0.0
df['txt'] = np.random.choice( list('abcdefghij'), size=len(df) )
df.to_csv('test.csv',index=False)

And here is a simple way to read it, but hopefully there is a better, more efficient way:

sdf = pd.read_csv( 'test.csv', dtype={'txt':'category'} ).to_sparse(fill_value=0.0)

Edit to Add (from JohnE): If possible, please provide some relative performance stats on reading large CSVs in your answer, including info on how you measured memory efficiency (especially as memory efficiency is harder to measure than clock time). In particular, note that a slower (clock time) answer could be the best answer here, if it is more memory efficient.

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Have you tried passing a dictionary to the `read_csv` kwarg `dtype`? My hunch is if you declare the column dtypes as `int` this will speed up the performance of the `to_sparse` method substantially. `Ctrl+f` 'dtype' [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) – user2734178 Aug 08 '15 at 05:09
  • @user2734178 I have. Unfortunately it didn't help. – Josephine Moeller Aug 08 '15 at 21:44
  • I imagine the most memory efficient approach is to read in chunks and then convert each chunk to sparse. A little bit of a pain but you'd never have the full uncompressed dataset in memory that way. – JohnE Aug 11 '15 at 01:51
  • @JohnE Yeah, I'm afraid that I'm going to have to do it that way. :-/ – Josephine Moeller Aug 11 '15 at 01:52
  • @JohnMoeller In my experience, pandas tends to behave pretty badly with dataframes full of floating point 0s. It eats a lot of memory. You may try `genfromtxt` from numpy to see if it's any better. – baloo Oct 02 '17 at 21:49
  • @baloo That's...not what I'm asking. I'm trying to avoid having the zeros in memory in the first place. `genfromtxt` produces a dense matrix, which defeats the purpose of what I'm trying to do. – Josephine Moeller Oct 03 '17 at 23:24
  • @JohnMoeller Sorry, that was just a suggestion, because I don't know enough about numpy. However, I am sure from first hand experience (similar to yours) that pandas handles very badly (memory-wise) lots of zeros, especially with floating-point values. – baloo Oct 04 '17 at 20:00
  • @JohnMoeller this is a very good questions - did you solve it yourself at the end? If so, how? – famargar Jan 03 '18 at 12:19
  • 1
    @JohnE it would be great if you could you code your answer. It sounds like the best way to go, but many users (including me) would not know where to start! – famargar Jan 03 '18 at 12:24
  • `np.genfromtxt` is about 5x-10x slower than `pd.read_csv`, fwiw, on the example code. about 10 sec for pandas and 60 sec for numpy. (I reduced the size of the csv by 100x and the proporational time difference remained the same. Specifying the numpy data types didn't improve the speed.) – Evan Jan 03 '18 at 17:33

2 Answers2

26

I would probably address this by using dask to load your data in a streaming fashion. For example, you can create a dask dataframe as follows:

import dask.dataframe as ddf
data = ddf.read_csv('test.csv')

This data object hasn't actually done anything at this point; it just contains a "recipe" of sorts to read the dataframe from disk in manageable chunks. If you want to materialize the data, you can call compute():

df = data.compute().reset_index(drop=True)

At this point, you have a standard pandas dataframe (we call reset_index because by default each partition is independently indexed). The result is equivalent to what you get by calling pd.read_csv directly:

df.equals(pd.read_csv('test.csv'))
# True

The benefit of dask is you can add instructions to this "recipe" for constructing your dataframe; for example, you could make each partition of the data sparse as follows:

data = data.map_partitions(lambda part: part.to_sparse(fill_value=0))

At this point, calling compute() will construct a sparse array:

df = data.compute().reset_index(drop=True)
type(df)
# pandas.core.sparse.frame.SparseDataFrame

Profiling

To check how the dask approach compares to the raw pandas approach, let's do some line profiling. I'll use lprun and mprun, as described here (full disclosure: that's a section of my own book).

Assuming you're working in the Jupyter notebook, you can run it this way:

First, create a separate file with the basic tasks we want to do:

%%file dask_load.py

import numpy as np
import pandas as pd
import dask.dataframe as ddf

def compare_loads():
    df = pd.read_csv('test.csv')
    df_sparse = df.to_sparse(fill_value=0)

    df_dask = ddf.read_csv('test.csv', blocksize=10E6)
    df_dask = df_dask.map_partitions(lambda part: part.to_sparse(fill_value=0))
    df_dask = df_dask.compute().reset_index(drop=True)

Next let's do line-by-line profiling for computation time:

%load_ext line_profiler

from dask_load import compare_loads
%lprun -f compare_loads compare_loads()

I get the following result:

Timer unit: 1e-06 s

Total time: 13.9061 s
File: /Users/jakevdp/dask_load.py
Function: compare_loads at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           def compare_loads():
     7         1      4746788 4746788.0     34.1      df = pd.read_csv('test.csv')
     8         1       769303 769303.0      5.5      df_sparse = df.to_sparse(fill_value=0)
     9                                           
    10         1        33992  33992.0      0.2      df_dask = ddf.read_csv('test.csv', blocksize=10E6)
    11         1         7848   7848.0      0.1      df_dask = df_dask.map_partitions(lambda part: part.to_sparse(fill_value=0))
    12         1      8348217 8348217.0     60.0      df_dask = df_dask.compute().reset_index(drop=True)

We see that about 60% of the time is spent in the dask call, while about 40% of the time is spent in the pandas call for the example array above. This tells us that dask is about 50% slower than pandas for this task: this is to be expected, because the chunking and recombining of data partitions leads to some extra overhead.

Where dask shines is in memory usage: let's use mprun to do a line-by-line memory profile:

%load_ext memory_profiler
%mprun -f compare_loads compare_loads()

The result on my machine is this:

Filename: /Users/jakevdp/dask_load.py

Line #    Mem usage    Increment   Line Contents
================================================
     6     70.9 MiB     70.9 MiB   def compare_loads():
     7    691.5 MiB    620.6 MiB       df = pd.read_csv('test.csv')
     8    828.8 MiB    137.3 MiB       df_sparse = df.to_sparse(fill_value=0)
     9                             
    10    806.3 MiB    -22.5 MiB       df_dask = ddf.read_csv('test.csv', blocksize=10E6)
    11    806.4 MiB      0.1 MiB       df_dask = df_dask.map_partitions(lambda part: part.to_sparse(fill_value=0))
    12    947.9 MiB    141.5 MiB       df_dask = df_dask.compute().reset_index(drop=True)

We see that the final pandas dataframe size is about ~140MB, but pandas uses ~620MB along the way as it reads the data into a temporary dense object.

On the other hand, dask only uses ~140MB total in loading the array and constructing the final sparse result. In the case that you are reading data whose dense size is comparable to the memory available on your system, dask has a clear advantage, despite the ~50% slower computational time.


But for working with large data, you should not stop here. Presumably you're doing some operations on your data, and the dask dataframe abstraction allows you to do those operations (i.e. add them to the "recipe") before ever materializing the data. So if what you're doing with the data involves arithmetic, aggregations, grouping, etc. you don't even need to worry about the sparse storage: just do those operations with the dask object, call compute() at the end, and dask will take care of applying them in a memory efficient way.

So, for example, I could compute the max() of each column using the dask dataframe, without ever having to load the whole thing into memory at once:

>>> data.max().compute()
x      5.38114
y      5.33796
z      5.25661
txt          j
dtype: object

Working with dask dataframes directly will allow you to circumvent worries about data representation, because you'll likely never have to load all the data into memory at once.

Best of luck!

jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • Thanks much for this answer! I should note that I found it really hard to get consistent results on the memory timings. I want back to the relevant section in your book ("profiling and timing code" in Data Science Handbook) -- free plug ;-) and mostly played around with %memit but again, just couldn't get really consistent results (I'll elaborate more in my answer) – JohnE Jan 09 '18 at 20:10
  • memit in the notebook is a bit finnicky – you need to restart the kernel before profiling the same function a second time or else you'll get strange results. – jakevdp Jan 09 '18 at 20:13
  • OK, I was actually doing in qt console if that matters. I'll try to play around with this later when I have enough time to reset kernel and such. – JohnE Jan 09 '18 at 20:23
  • How can I iterate over rows in dask's dataframe? – aviral sanjay Jan 11 '19 at 14:57
  • to_sparse doesn't seem to be a thing anymore – Fr4nc3sc0NL May 24 '21 at 20:59
11

Here's an answer offered mainly as a benchmark. Hopefully there are better ways than this.

chunksize = 1000000       # perhaps try some different values here?
chunks = pd.read_csv( 'test.csv', chunksize=chunksize, dtype={'txt':'category'} )
sdf = pd.concat( [ chunk.to_sparse(fill_value=0.0) for chunk in chunks ] )

As @acushner notes, you could instead do this as a generator expression:

sdf = pd.concat( chunk.to_sparse(fill_value=0.0) for chunk in chunks )

There seems to be consensus that this is better than the list comp way although in my testing I didn't see any large differences but perhaps you might with different data.

I was hoping to report some memory profiling on the various methods, but struggled to get consistent results, I suspect because python is always cleaning up memory behind the scenes, resulting in some random noise being added to the results. (In a comment to Jake's answer, he suggests restarting the jupyter kernel before each %memit to get more consistent results but I have not yet tried that.)

But I did consistently find (using %%memit) that the chunking read above and @jakevdp's dask method both used something very roughly in the neighborhood of half the memory as the naive method in the OP. For more on profiling, you should check out "Profiling and Timing Code" in Jake's book "Python Data Science Handbook".

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Just as an FYI, you can make an empty `DataFrame` and then append to it, saving you from having to have all of the chunks in memory at the same time. – Josephine Moeller Jan 04 '18 at 14:28
  • 1
    @JohnMoeller that's not a great idea. every time you append to a dataframe, it has to reallocate the entire dataset to make it contiguous. on another note, use a gen expr instead of a list comp in the `concat` call. – acushner Jan 04 '18 at 14:45
  • gotcha, i mean, conceptually and pythonically, it makes more sense to use a gen expr here. if you use a list comp, you're creating a list and then immediately throwing it away. plus, a gen expr is cleaner. your answer's good, no reason for me to add one of mine, i would just use the gen expr for the reasons above. – acushner Jan 05 '18 at 20:34
  • 1
    @johnmoeller one other piece of info: if a gen expr is the sole argument to a function call, it doesn't need the extra set of parens (e.g. `sum(i for i in range(10))`) – acushner Jan 08 '18 at 15:38