I have a HDF5_generator that returns data like this:
for element_i in range(n_elements):
img = f['data'][:].take(indices=element_i, axis=element_axis)
yield img, label, weights
I do slicing, because h5py doesn't seem to provide a different way of reading (please correct me if I'm wrong) and I do it this way (f['data'][:].take(...)
) because I want the slicing axis to be dynamic and don't know how to do "classic" slicing (f['data'][:, :, element_i, :, :]
) with a dynamic axis.
But this is awfully slow! I don't even know what happens, because read-times fluctuate so heavily, but I assume that for every element_i
, the whole dataset data
is read completely and sometimes by chance it is still cached but sometimes not.
I came up with "cache_full_file" (see full code below) and this kind of solves it:
cache_full_file = False
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024) image read - Elapsed time: 6.5959 s # every single read can take long
(4, 1024, 1024) image read - Elapsed time: 28.0695 s
(4, 1024, 1024) image read - Elapsed time: 0.6851 s
(4, 1024, 1024) image read - Elapsed time: 3.3492 s
(4, 1024, 1024) image read - Elapsed time: 0.5837 s
(4, 1024, 1024) image read - Elapsed time: 1.0346 s
(4, 1024, 1024) image read - Elapsed time: 2.5852 s
(4, 1024, 1024) image read - Elapsed time: 18.7262 s
(4, 1024, 1024) image read - Elapsed time: 19.1674 s # ...
cache_full_file = True
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024) image read - Elapsed time: 15.8334 s # dataset is read and cached once
(4, 1024, 1024) image read - Elapsed time: 0.0744 s # following reads are all fast ...
(4, 1024, 1024) image read - Elapsed time: 0.0558 s # ...
But I can't rely on full files/datasets fitting into memory!
Can I do a "lazy" read that doesn't read the full dataset to take out a slice from a HDF5 dataset?
A simplified version of the code of the class is:
class hdf5_generator:
def __init__(self, file, repeat): self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as f:
n_elements = f['data'].shape[element_axis] # poke first dataset to get number of expected elements)
if cache_full_file:
img_eles = f['data'][:] # read and store the whole dataset in memory
for element_i in range(n_elements):
img = img_eles.take(indices=element_i, axis=element_axis)
yield img
else:
for element_i in range(n_elements):
# access a specific row in the dataset
img = f['data'][:].take(indices=element_i, axis=element_axis)
yield img, label, weights