Why is reading h5 files extremely slow?

Question

I have a data generator which works but is extremely slow to read data from a 200k image dataset.

I use:

 X=f[self.trName][idx * self.batch_size:(idx + 1) * self.batch_size]

after having opened the file with f=h5py.File(fileName,'r')

It seems to be slower as the idx is large (sequential access?) but in any case it is at least 10 seconds (sometimes >20 sec) to read a batch, which is far too slow (moreover reading from an SSD!)

Any ideas?

The dataset is taking 50.4 GB on disk (compressed) and its shape is: (210000, 2, 128, 128)

(this is the shape of the trainingset, the targets have the same shape, and are stored as another dataset inside this same .h5 file)

It depends on how much data you are reading, and (maybe the size of the file and dataset). How big is the batch size? How big is the dataset, 200k images = xx GB? — kcw78, Apr 10 '20 at 20:40
Indeed please tell us the dataset shape and batch size. Furthermore please give the dataset chunk sizes. And read about chunking [here](http://docs.h5py.org/en/stable/high/dataset.html#chunked-storage) if you haven't done so yet. Important quote: _...keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk._ — titusjan, Apr 11 '20 at 07:26
Ok, so the dataset is taking 50.4 GB (compressed using opt=9 max compression) and the shape is (210000, 2, 128, 128) for the trainingset (the targets have the same shape) — SheppLogan, Apr 11 '20 at 13:59
maybe I should add that: (this is the shape of the trainingset, the targets have the same shape, and are stored as another dataset inside this same .h5 file, there is also a small test set and validation set but their size is almost negligeable compared to the trainingset (they are 20k of size: (20k,2,128,128)) — SheppLogan, Apr 11 '20 at 14:00
Check your chunk-cache size. You may be reading the entire file multiple times if you don't set up a proper value for that. https://stackoverflow.com/a/48405220/4045774 (In the meantime this was implemented in h5py). There are also much much faster compression algorithms available (BLOSC) exampe https://stackoverflow.com/a/48997927/4045774 — max9111, Apr 16 '20 at 20:02

Why is reading h5 files extremely slow?

0 Answers0