0

I have a dataset in CSV containing lists of values as strings in a single field that looks more or less like this:

Id,sequence
1,'1;0;2;6'
2,'0;1'
3,'1;0;9'

In the real dataset I'm dealing with, the sequence length vary greatly and can contain from one up to few thousands observations. There are many columns containing sequences all stored as strings.

I'm reading those CSV's and parsing strings to become lists nested inside Pandas DataFrame. This takes some time, but I'm ok with it.

However, later when I save the parsed results to pickle the read time of this pickle file is very high.

I'm facing the following:

  • Reading a raw ~600mb CSV file of such structure to Pandas takes around ~3 seconds.
  • Reading the same (raw, unprocessed) data from pickle takes ~0.1 second.
  • Reading the processed data from pickle takes 8 seconds!

I'm trying to find a way to read processed data from disk in the quickest possible way.

Already tried:

  • Experimenting with different storage formats but most of them can't store nested structures. The only one that worked was msgpack but that didn't improve the performance much.
  • Using structures other than Pandas DataFrame (like tuple of tuples) - faced similar performance.

I'm not very tied to the exact data structure. The thing is I would like to quickly read parsed data from disk directly to Python.

Mikolaj
  • 1,395
  • 2
  • 13
  • 32
  • Have you tried any of the [other formats that Pandas supports](http://pandas.pydata.org/pandas-docs/stable/api.html#serialization-io-conversion)? – Martijn Pieters Jan 07 '19 at 17:03
  • 1
    Quick googling around shows others have put in the work to assess different options, see http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization, https://tech.blue-yonder.com/efficient-dataframe-storage-with-apache-parquet/ and https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html for examples; the current recommendation appears to be to [use pyarrow](https://arrow.apache.org/docs/python/ipc.html#serializing-pandas-objects). – Martijn Pieters Jan 07 '19 at 17:06
  • "I'm reading those CSV's and parsing strings to become lists nested inside Pandas DataFrame." Doesn't sound like a good use-case for pandas. Note, a lot of the high-performance serialization libraries for pandas won't work with `dtype=object`, at least last time I checked – juanpa.arrivillaga Jan 07 '19 at 17:35
  • could you "unnest" the data structure? i.e. repeat the "Id" value for every "sequence" value. having both columns as plain `int`s would make loading/saving much faster, but should obviously be traded off against other complexity… – Sam Mason Jan 07 '19 at 18:52

1 Answers1

0

This might be a duplicate to this question

HDF5 is quite a bit quicker at handling nested pandas dataframes. I would give that a shot.

An example usage borrowed from here shows how you can chunk it efficiently when dumping:

import glob, os
import pandas as pd

df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
store = pd.HDFStore('test.h5')
nrows = store.get_storer('df').nrows
chunksize = 100
for i in xrange(nrows//chunksize + 1):
    chunk = store.select('df',
                         start=i*chunksize,
                         stop=(i+1)*chunksize) 
store.close()

When reading it back, you can do it in chunks like this, too:

for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
    print df.info()
    print(df.head(5))
Daniel Scott
  • 979
  • 7
  • 16