Convert multi-node PyTable to bcolz

Question

I'm looking to experiment a bit with bcolz and see if it is compatible with what I need to do. I have a dataset consisting of about 11 million rows and about 120 columns. This data is currently stored in PyTables "table" format in an HDF5 file. The data is divided into several "groups" (separate nodes) in the HDF5 file, each containing different columns.

What I want to do is to convert all of this data into an on-disk bcolz ctable, without reading it all into memory at once. I was able to do that for the first group by doing this (basic is the name of one of the groups):

bcolz.ctable.fromhdf5('census.h5', '/basic/table', rootdir='census')

When I did this, memory usage remained low, indicating it was not reading in the entire table at once. Great! However, if I try to do it again, appending to the same ctable:

>>> bcolz.ctable.fromhdf5('census.h5', '/political/table', rootdir='census', mode='a')
Traceback (most recent call last):
  File "<pyshell#34>", line 1, in <module>
    bcolz.ctable.fromhdf5('census.h5', '/political/table', rootdir='census', mode='a')
  File "C:\FakeProgs\Python27\lib\site-packages\bcolz\ctable.py", line 714, in fromhdf5
    ct = ctable(cols, names, **kwargs)
  File "C:\FakeProgs\Python27\lib\site-packages\bcolz\ctable.py", line 205, in __init__
    "You cannot pass a `columns` param in 'a'ppend mode.\n"
ValueError: You cannot pass a `columns` param in 'a'ppend mode.
(If you are trying to create a new ctable, perhaps the directory exists already.)

Yes, of course it exists already. One of the advantages of bcolz is supposed to be that it is easy to add new columns. How can I leverage this advantage to add new columns from an existing HDF5 file directly to an existing on-disk ctable, without reading all the new columns into memory first?

Djizeus · Answer 1 · 2015-10-22T12:04:40.240

One idea could be to leverage the fact that when adding a column to a ctable, if the column is an existing carray you can request to move the on-disk file (which is immediate) instead of copying it. So you can first create an array for each column of the hdf5 table, and then add it to the ctable.

See the code below for a general direction. It is inspired from how bcolz create a table from a hdf5 file, and was not tested:

table = bcolz.ctable.fromhdf5('census.h5', '/basic/table', rootdir='census')

fp = tables.open_file('census.h5')
h5table = fp.get_node('/political/table')
for colname in h5table.colnames:
    h5column = h5table.colinstances[colname]
    #create the column
    coltype = h5table.coldtypes[colname]
    nparr = np.zeros(0, dtype=coltype)
    column = bcolz.carray(nparr, mode='w', rootdir='tmpcol')

    #Fill it in chunks
    chunklen = h5table._v_chunkshape[0]
    for i in xrange(0, len(h5table), chunklen):
        column.append(h5column[i:i+chunklen])
    column.flush()

    #Add column to table without copying
    table.addcol(column, name=colname, move=True)

I agree it would be nice to have the feature in blocz, but I guess the package is young. Maybe you can add it! :) Note also that the data will be read into memory before storing it in bcolz table. But it will be read only once, and in chunks, unlike if you'd first create a table in-memory from the h5 file and they copy the columns onto the first table. I'm assuming that's what you meant with without reading all the new columns into memory.

Convert multi-node PyTable to bcolz

1 Answers1