Concatenate two big pandas.HDFStore HDF5 files

Question

This question is somehow related to "Concatenate a large number of HDF5 files".

I have several huge HDF5 files (~20GB compressed), which could not fit the RAM. Each of them stores several pandas.DataFrames of identical format and with indexes that do not overlap.

I'd like to concatenate them to have a single HDF5 file with all DataFrames properly concatenated. One way to do this is to read each of them chunk-by-chunk and then save to a single file, but indeed it would take quite a lot of time.

Are there any special tools or methods to do this without iterating through files?

Jeff · Accepted Answer · 2015-03-07T21:23:58.197

12

see docs here for the odo project (formerly into). Note if you use the into library, then the argument order has been switched (that was the motivation for changing the name, to avoid confusion!)

You can basically do:

from odo import odo
odo('hdfstore://path_store_1::table_name',
    'hdfstore://path_store_new_name::table_name')

doing multiple operations like this will append to the rhs store.

This will automatically do the chunk operations for you.

edited Mar 07 '15 at 21:23

answered Mar 07 '15 at 19:39

Jeff

125,376
21
220
187

1

Awesome; first real-world mention of Blaze I've seen in the wild. – Veedrac Mar 08 '15 at 03:41
So Blaze is awesome. However this works strangely for me. After running odo as above, eventually I get a giant stack of stdout 'closing file', I think all mentioning the target store, not the source. Does that sound like a bug or am I missing some pre/post steps? – KobeJohn Mar 04 '16 at 04:57

Concatenate two big pandas.HDFStore HDF5 files

1 Answers1

Linked