0

I have a data generator which works but is extremely slow to read data from a 200k image dataset.

I use:

 X=f[self.trName][idx * self.batch_size:(idx + 1) * self.batch_size]

after having opened the file with f=h5py.File(fileName,'r')

It seems to be slower as the idx is large (sequential access?) but in any case it is at least 10 seconds (sometimes >20 sec) to read a batch, which is far too slow (moreover reading from an SSD!)

Any ideas?

The dataset is taking 50.4 GB on disk (compressed) and its shape is: (210000, 2, 128, 128)

(this is the shape of the trainingset, the targets have the same shape, and are stored as another dataset inside this same .h5 file)

SheppLogan
  • 322
  • 3
  • 18
  • It depends on how much data you are reading, and (maybe the size of the file and dataset). How big is the batch size? How big is the dataset, 200k images = xx GB? – kcw78 Apr 10 '20 at 20:40
  • 1
    Indeed please tell us the dataset shape and batch size. Furthermore please give the dataset chunk sizes. And read about chunking [here](http://docs.h5py.org/en/stable/high/dataset.html#chunked-storage) if you haven't done so yet. Important quote: _...keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk._ – titusjan Apr 11 '20 at 07:26
  • Ok, so the dataset is taking 50.4 GB (compressed using opt=9 max compression) and the shape is (210000, 2, 128, 128) for the trainingset (the targets have the same shape) – SheppLogan Apr 11 '20 at 13:59
  • maybe I should add that: (this is the shape of the trainingset, the targets have the same shape, and are stored as another dataset inside this same .h5 file, there is also a small test set and validation set but their size is almost negligeable compared to the trainingset (they are 20k of size: (20k,2,128,128)) – SheppLogan Apr 11 '20 at 14:00
  • Check your chunk-cache size. You may be reading the entire file multiple times if you don't set up a proper value for that. https://stackoverflow.com/a/48405220/4045774 (In the meantime this was implemented in h5py). There are also much much faster compression algorithms available (BLOSC) exampe https://stackoverflow.com/a/48997927/4045774 – max9111 Apr 16 '20 at 20:02

0 Answers0