4

I have a HDF5_generator that returns data like this:

for element_i in range(n_elements):
    img = f['data'][:].take(indices=element_i, axis=element_axis)
    yield img, label, weights

I do slicing, because h5py doesn't seem to provide a different way of reading (please correct me if I'm wrong) and I do it this way (f['data'][:].take(...)) because I want the slicing axis to be dynamic and don't know how to do "classic" slicing (f['data'][:, :, element_i, :, :]) with a dynamic axis.

But this is awfully slow! I don't even know what happens, because read-times fluctuate so heavily, but I assume that for every element_i, the whole dataset data is read completely and sometimes by chance it is still cached but sometimes not.

I came up with "cache_full_file" (see full code below) and this kind of solves it:

cache_full_file = False
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024)  image read - Elapsed time: 6.5959 s            # every single read can take long
(4, 1024, 1024)  image read - Elapsed time: 28.0695 s
(4, 1024, 1024)  image read - Elapsed time: 0.6851 s
(4, 1024, 1024)  image read - Elapsed time: 3.3492 s
(4, 1024, 1024)  image read - Elapsed time: 0.5837 s
(4, 1024, 1024)  image read - Elapsed time: 1.0346 s
(4, 1024, 1024)  image read - Elapsed time: 2.5852 s
(4, 1024, 1024)  image read - Elapsed time: 18.7262 s
(4, 1024, 1024)  image read - Elapsed time: 19.1674 s           # ...


cache_full_file = True
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024)  image read - Elapsed time: 15.8334 s           # dataset is read and cached once
(4, 1024, 1024)  image read - Elapsed time: 0.0744 s            # following reads are all fast ...      
(4, 1024, 1024)  image read - Elapsed time: 0.0558 s            # ...

But I can't rely on full files/datasets fitting into memory!

Can I do a "lazy" read that doesn't read the full dataset to take out a slice from a HDF5 dataset?


A simplified version of the code of the class is:

class hdf5_generator:
    def __init__(self, file, repeat): self.file = file
    def __call__(self):
        with h5py.File(self.file, 'r') as f:
            n_elements = f['data'].shape[element_axis] # poke first dataset to get number of expected elements)

            if cache_full_file:
                img_eles = f['data'][:]     # read and store the whole dataset in memory
                for element_i in range(n_elements):
                    img = img_eles.take(indices=element_i, axis=element_axis)
                    yield img
            else:
                for element_i in range(n_elements):
                    # access a specific row in the dataset
                    img = f['data'][:].take(indices=element_i, axis=element_axis)
                    yield img, label, weights
Honeybear
  • 2,928
  • 2
  • 28
  • 47
  • It depends on the chunkshape of the dataset. You can also cache chunks that are read and decompressed. Take a look at https://stackoverflow.com/a/48405220/4045774 – max9111 Mar 20 '18 at 10:12

0 Answers0