Using PyTables to index a 500 GB HDF5 file

Question

I would like to dump a keyed 500GB-800GB table into HDF5, and then then retrieve rows matching specific keys.

For an HDF5 file, items like all the data access uses an integer "row" number, so seems like I would have to implement a 'key to row number map" outside of HDF5.

Would this work? Do I need to access the entire HDF5 "in memory (RAM)"?

Can anyone give me some sense how poorly HDF5 would perform in this situation? If there's decent indexing, this is just a huge dictionary, right?

Should I be using something else?

PyTables supports indexes on columns ( see [here](http://www.pytables.org/usersguide/optimization.html#indexed-searches). Using `numexpr` with the OPSI index should be fast enough and shouldn't require to read everything into memory — Ümit, Jul 11 '16 at 08:01
@Ümit Thanks for this. This is actually a pretty great idea. Could you provide a quick example below? e.g. make table, put OSPI indices on a hand-full of columns with PyTables, use numexpr to show how this is done, etc. It's easier to follow, and useful for StackExchange — ShanZhengYang, Jul 11 '16 at 10:59

score 6 · Accepted Answer · edited May 23 '17 at 11:58

Assume that you have defined this record type in PyTables

class Record(tables.IsDescription):
    row = tables.Int32Col()
    col1 = tables.Int32Col()
    col2 = tables.Float64Col()
    col3 = tables.Float64Col()

A regular range query might look like this:

result = [rec for rec in table if (rec['row'] > 100 and rec['row'] < 200)]

This works fine of your table is not too big. However for large tables it will be relatively slow because each row has to be brought into the Python space to be evaluated for the range condition.

In order to speed up this query one can rely on so called in-kernel queries, which allows to check the condition using a PyTables kernel that is written in C with the help of the numexpr library.

result = [rec for rec in table.where(
            'row > 100 & row < 200')]

You can also mix and match a regular query with a in-kernel query:

result = [rec for rec in table.where(
            'row > 100 & row < 200')] if your_function(rec['col2']) ]

If you have large tables that don't fit into memory the speedup is around 2x. Using compression (i.e. BLOSC, LZF, etc) will give you a slight speed bump as the CPU overhead of uncompressing is less than the I/O overhead (so use compression for huge tables that don't fit into the memory).

When you use compression, the dataset will be split up in chunks and the chunks are being compressed separately. That means if you query for a specific range (rows 100 - 200) the corresponding compressed chunks will be loaded from disk to the memory and then uncompressed by the CPU in memory. This will speed up things compared to using no compression or storing the dataset continuously. Blosc is a meta-compressor and lzf is the default compressor of h5py. For differences between Blosc and lzf see this thread.

If the in-kernel queries are not fast enough, you can also create an index on one or more columns. This way the query will use a binary search instead of a sequential scan. To create an index on an existing table for the row column just run:

indexrows = table.cols.row.create_index()

But beware that the index won't be used on all conditions (see reference below). To check whether your query is using the index properly, you can use the Table.will_query_use_indexing() method.

Source: http://www.pytables.org/usersguide/optimization.html#indexed-searches

" Using compression (i.e. BLOSC, LZF, etc) will give you a slight speed bump as the CPU overhead of uncompressing is less than the I/O overhead (so use compression for huge tables that don't fit into the memory)." I'm not sure I follow this. 500 GB won't fit into RAM. Is there a difference in performance between BLOSC and LZF? Could you sketch out above how you use data compression? (Sorry---just need a few more details to make the post comprehensive. Thanks!) — ShanZhengYang, Jul 11 '16 at 13:24
Also, is there a difference in terms of BLOSC vs LZF? Thank you for any help! — ShanZhengYang, Jul 11 '16 at 15:45
Question: What if your queries are not "sequential" but random access? Does this scheme still work? That is, imagine if your query needs to access several chunks. Is this still feasible? — ShanZhengYang, Jul 11 '16 at 18:20
Random access will in fact work better with chunked (compressed) layout better than with the contiguous layout — Ümit, Jul 11 '16 at 22:00
Could you explain this? Even if the query needs to access multiple chucks? This is because with a contiguous layout a random access query would need to access the entire dataset? — ShanZhengYang, Jul 11 '16 at 22:12
You can also do random access with contiguous dataset layout, however you can't do compression (compression requires chunking). Also chunking is more important when you have a multi-dimensional dataset (for a vector it might not make a big difference). For chunking refer to these links: https://www.hdfgroup.org/HDF5/doc1.6/Chunking.html https://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/ https://www.hdfgroup.org/training/HDFtraining/UsersGuide/Perform.fm2.html http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters — Ümit, Jul 12 '16 at 07:52
I'm wondering if i can tell pytables to use binary search for a specific row, because i have ensured, that the data is sorted in this row (when you have a time series database, the time column will possibly be sorted from the beginning)? If not and I have to build an index every time i append some new samples, is the sorting algorithm smart enough to basically terminate immediately, because no sorting work is left? Thanks in advance! — Zappel, Jul 25 '17 at 20:23

Using PyTables to index a 500 GB HDF5 file

1 Answers1