Where it the h5py performance bottleneck?

Question

I have an HDF5 with 100 "events". Each event contains variable, but roughly 180 groups called "traces", and each trace has inside 6 datasets which are arrays of 32 bit floats, each ~1000 cells long (this carries slightly from event to event, but remains constant inside an event). The file was generated with default h5py settings (so no chunking or compression unless h5py does it on its own).

The readout is not fast. It is ~6 times slower than readout of the same data from CERN ROOT TTrees. I know that HDF5 is far from the fastest formats on the market, But I would be grateful, if you could tell me, where the speed is lost.

To read the arrays in traces I do:

    d0keys = data["Run_0"].keys()
    for key_1 in d0keys:
        if("Event_" in key_1):
            d1 = data["Run_0"][key_1]
            d1keys = d1.keys()
            for key_2 in d1keys:
                if("Traces_" in key_2):
                    d2 = d1[key_2]
                    v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'][0],d2['SimSignal_Y'][0],d2['SimSignal_Z'][0],d2['SimEfield_X'][0], d2['SimEfield_Y'][0],d2['SimEfield_Z'][0]

Line profiler shows, that ~97% of the time is spent in the last line. Now, there are two issues:

It seems there is no difference between reading cell [0] and all the ~1000 cells with [:]. I understand that h5py should be able to read just a chunk of data from the disk. why no difference?
Reading 100 events from HDD (Linux, ext4) takes ~30 s with h5py, and ~5 s with ROOT. The size of 100 events is roughly 430 MB. This gives readout speed in HDF of ~14 MBps, while ROOT is ~86 MBps. Both slow, but ROOT comes much closer to the raw readout speed that I would expect from ~4 yo laptop HDD.

So where does h5py loses its speed? I guess the pure readout should be just the HDD speed. Thus, is the bottleneck:

Dereferencing HDF5 address to the dataset (ROOT does not need to do it)?
Allocating memory in python?
Something else?

I would be grateful for some clues.

I thought of something after I wrote my answer. What are `['SimSignal_X'] ... ['SimEfield_Z']`? I assumed they are datasets (keys) in group `d2`. However, if `d2` is a dataset and they are fields in a compound data type, you can read all of the data at once with `arr=d2` (to get a h5py dataset object), or `arr=d2[:]` (to get a numpy recarray). This reduces the number of reads and time spent reading by 6x. This would be comparable with CERN performance. Then, when you need values from a single field, you access as `arr[field_name][:]`. — kcw78, Apr 25 '21 at 16:58
In this file d2 is a group and I think it has to remain as such, because the d2 structure may change in the future, but i will keep a compound type in mind. — Lech Wiktor Piotrowski, Apr 25 '21 at 19:07
For each event, you read 1 number from each of about 1000 separate datasets. If you can control how the file is written, try putting each event in a single dataset, with shape (180, 6, 1000). Then you can read the same 180 x 6 numbers in a single array as `event[:, :, 0]`, which is likely to be much faster. — Thomas K, May 28 '21 at 16:28

score 1 · Accepted Answer · answered Apr 25 '21 at 16:45

There are a lot of HDF5 I/O issues to consider. I will try to cover each.

From my tests, time spent doing I/O is primarily a function of the number of reads/writes and not how much data (in MB) you read/write. Read this SO post for more details: pytables writes much faster than h5py. Why? Note: it shows I/O performance for a fixed amount of data with different I/O write sizes for both h5py and PyTables. Based on this, it makes sense that most of the time is spent in the last line -- that's where you are reading the data from disk to memory as NumPy arrays (v1, v2, v3, v4, v5, v6).

Regarding your questions:

There's a reason there is no difference between reading d2['SimSignal_X'][0] and d2['SimSignal_X'][:]. Both read the entire dataset into memory (all ~1000 dataset values). If you only want to read a slice of the data, you need to use slice notation. For example, d2['SimSignal_X'][0:100] only reads the first 100 values (assumes d2['SimSignal_X'] only has a single axis -- shape=(1000,)). Note; reading a slice will reduce required memory, but won't improve I/O read time. (In fact, reading slices will probably increase read time.)
I am not familiar with CERN ROOT, so can't comment about performance of h5py vs ROOT. If you want to use Python, here are several things to consider:
- You are reading the HDF5 data into memory (as a NumPy array). You don't have to do that. Instead of creating arrays, you can create h5py dataset objects. This reduces the I/O initially. Then you use the objects "as-if" they are np.arrays. The only change is your last line of code -- like this: v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'], d2['SimSignal_Y'], d2['SimSignal_Z'], d2['SimEfield_X'], d2['SimEfield_Y'], d2['SimEfield_Z']. Note how the slice notation is not used ([0] or [:]).
- I have found PyTables to be faster than h5py. Your code can be easily converted to read with PyTables.
- HDF5 can use "chunking" to improve I/O performance. You did not say if you are using this. It might help (hard to say since your datasets aren't very large). You have to define this when you create the HDF5 file, so might not be useful in your case.
- Also you did not say if a compression filter was used when the data was written. This reduces the on-disk file size, but has the side effect of reducing I/O performance (increasing read times). If you don't know, check to see if compression was enabled at file creation.

Thanks for the answer, I understand more now. No chunking and no compression was used - that's what I meant when writing that the file was generated with default h5py settings. I also compare to uncompressed ROOT. So I list my misconceptions here: 1. I thought that [0] would be equivalent to slicing. Basically [0] is equal to [0:1] in numpy, I thought it is the same in h5py — Lech Wiktor Piotrowski, Apr 25 '21 at 19:15
2. I don't understand why slicing would not improve the readout time. There should be a difference between reading 100 MB and 1 GB. Unless we start to be CPU bound. Is that the case? 3. I'll try pytables 4. I thought that just d2['something'] is kind of a lazy read. No real read is performed until an index is accessed. Isn't it the case? — Lech Wiktor Piotrowski, Apr 25 '21 at 19:19
Time to read a 100MB or 1GB dataset into memory is about the same. (I don't know the details why). So, if you read a 1GB dataset as 10x 100MB, it will take about 10x longer. You are correct about d2['something']. No data is read into memory until an index is used. That's the beauty of HDF5. You can have a 20GB dataset, and only read a slice into memory when you need it. — kcw78, Apr 25 '21 at 19:51
Hmm, but there should be a real difference between 100 MB and 1 GB. One can't jump over I/O capabilities of the HDD... So I am puzzled. Also, removing arrays/indexing makes the code ~times faster, but it still takes ~10 s for 100 events. So my bet is that significant time goes into just finding what d2['SimSignal_X'] is. — Lech Wiktor Piotrowski, Apr 25 '21 at 20:11
Take a look at the graph in the linked answer. It shows how the time to write a fixed amount of data increases with the number of writes (with smaller amount of data written on each write). I don't know why. I suspect the overhead is fixed and the actual write to disk is not the bottleneck. — kcw78, Apr 25 '21 at 22:32
It seems many reads is indeed slow. The question is why - I am betting on the HDF5 search, dereferencing, etc. Still, I/O has impact. Same file readout with h5py on my HDD vs SSD was 45 vs 34 s. Moreover, h5py is indeed not optimal I coded the same readout in C++ using HighFive and I got 14 s and 9 s (HDD & SSD). Well, I've learned things. Still, not sure why HDF is slow reading many things :) — Lech Wiktor Piotrowski, Apr 28 '21 at 21:52
It's not true in general that `d2['SimSignal_X'][0]` will read the whole dataset into memory - it loads the same data as slicing `[0:1]`. For contiguous datasets, it will read only the part you request. For chunked datasets, it will read every chunk which that slice touches (which *could be* the entire dataset, depending on the chunking). But the difference between reading 1 number and 1000 is probably small enough that it doesn't make much difference. — Thomas K, May 28 '21 at 16:35

Where it the h5py performance bottleneck?

1 Answers1