7

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv VS hdf5 files).

In order to benchmark performance, I did the following:

def dask_read_from_hdf():
    results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
    analyzed_stocks_dd_hdf =  results_dd_hdf.Security.unique()
    hdf.close()

def pandas_read_from_hdf():
    results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
    analyzed_stocks_pd_hdf =  results_pd_hdf.Security.unique()
    hdf.close()

def dask_read_from_csv():
    results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
    analyzed_stocks_dd_csv =  results_dd_csv.Security.unique()

def pandas_read_from_csv():
    results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
    analyzed_stocks_pd_csv =  results_pd_csv.Security.unique()

print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()

My findings are:

dask hdf performance
10 loops, best of 3: 133 ms per loop

pandas hdf performance
1 loop, best of 3: 1.42 s per loop

dask csv performance
1 loop, best of 3: 7.88 ms per loop

pandas csv performance
1 loop, best of 3: 827 ms per loop

When hdf5 storage can be accessed faster than .csv, and when dask creates dataframes faster than pandas, why is dask from hdf5 slower than dask from csv? Am I doing something wrong?

When does it make sense for performance to create dask dataframes from HDF5 storage objects?

sudonym
  • 3,788
  • 4
  • 36
  • 61

1 Answers1

15

HDF5 is most efficient when working with numerical data, I'm guessing you are reading a single string column, which is its weakpoint.

Performance of string data with HDF5 can be dramatically improved by using a Categorical to store your strings, assuming relatively low cardinality (high number of repeated values)

It's from a little while back, but a good blog post here going through exactly these considerations. http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

You may also look at using parquet - it is similar to HDF5 in that it is a binary format, but is column oriented, so a single column selection like this will likely be faster.

Recently (2016-2017) there has been significant work to implement a fast native reader of parquet->pandas, and the next major release of pandas (0.21) will have to_parquet and pd.read_parquet functions built in.

https://arrow.apache.org/docs/python/parquet.html

https://fastparquet.readthedocs.io/en/latest/

https://matthewrocklin.com/blog//work/2017/06/28/use-parquet

chrisb
  • 49,833
  • 8
  • 70
  • 70