Writing larger than memory data into bcolz

Question

so I got this big tick data file (one day 60GB uncompressed) that I want to put into bcolz. I planned to read this file chunk by chunk and append them into bcolz.

As far as I know, bcolz only support append columns not rows. However, tick data is more row-wise than column-wise I would say. For instance:

0  ACTX.IV         0  13.6316 2016-09-26 03:45:00.846     ARCA        66   
1  ACWF.IV         0  23.9702 2016-09-26 03:45:00.846     ARCA        66   
2  ACWV.IV         0  76.4004 2016-09-26 03:45:00.846     ARCA        66   
3  ALTY.IV         0  15.5851 2016-09-26 03:45:00.846     ARCA        66   
4  AMLP.IV         0  12.5845 2016-09-26 03:45:00.846     ARCA        66

Does anyone have any suggestions on how to do this?
And is there any suggestion on compress level I should choose, when using bcolz. I'm more concerned about later query speed than size. (I'm asking this, coz as shown below, it seems level one compressed bcolz ctable actually has better query speed than uncompressed one. So my guess would be the query speed is not a monotonic function with compression level). reference: http://nbviewer.jupyter.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb

Thanks in advance!

score 0 · Answer 1 · answered Feb 09 '17 at 15:59

You could do something like:

import blaze as bz
ds = bz.data('my_file.csv')
for chunk in bz.odo(ds, bz.chunks(pd.DataFrame), chunksize=1000000):
    bcolz.ctable.fromdataframe(chunk, rootdir=dir_path_for_chunk,
                               mode='w', 
                               cparams=your_compression_params)

and than use bcolz.walk to iterate over chunks.

Default (blosc level 5) would be appropriate in most cases. If you want every bit of performance you'll have to create sample file from real data with size around 1-2 GB and test performance with different compression parameters.

Writing larger than memory data into bcolz

1 Answers1