I'm looking to experiment a bit with bcolz and see if it is compatible with what I need to do. I have a dataset consisting of about 11 million rows and about 120 columns. This data is currently stored in PyTables "table" format in an HDF5 file. The data is divided into several "groups" (separate nodes) in the HDF5 file, each containing different columns.
What I want to do is to convert all of this data into an on-disk bcolz ctable, without reading it all into memory at once. I was able to do that for the first group by doing this (basic
is the name of one of the groups):
bcolz.ctable.fromhdf5('census.h5', '/basic/table', rootdir='census')
When I did this, memory usage remained low, indicating it was not reading in the entire table at once. Great! However, if I try to do it again, appending to the same ctable:
>>> bcolz.ctable.fromhdf5('census.h5', '/political/table', rootdir='census', mode='a')
Traceback (most recent call last):
File "<pyshell#34>", line 1, in <module>
bcolz.ctable.fromhdf5('census.h5', '/political/table', rootdir='census', mode='a')
File "C:\FakeProgs\Python27\lib\site-packages\bcolz\ctable.py", line 714, in fromhdf5
ct = ctable(cols, names, **kwargs)
File "C:\FakeProgs\Python27\lib\site-packages\bcolz\ctable.py", line 205, in __init__
"You cannot pass a `columns` param in 'a'ppend mode.\n"
ValueError: You cannot pass a `columns` param in 'a'ppend mode.
(If you are trying to create a new ctable, perhaps the directory exists already.)
Yes, of course it exists already. One of the advantages of bcolz is supposed to be that it is easy to add new columns. How can I leverage this advantage to add new columns from an existing HDF5 file directly to an existing on-disk ctable, without reading all the new columns into memory first?