0

I am having big image datasets to train CNN's on. Since I cannot load all the images into my RAM I plan to dump them into a HDF5 file (with h5py) and then iterate over the set batchwise, as suggested in

Most efficient way to use a large data set for PyTorch?

I tried creating an own dataset for every picture, located in the same group, which is very fast. But I could not figure out to iterate over all datasets in the group, except for accessing the set by its name. As an alternative I tried putting all the images itereatively into one dataset by extending its shape, according to

How to append data to one specific dataset in a hdf5 file with h5py and

incremental writes to hdf5 with h5py

but this is very slow. Is there a faster way to create a HDF5 dataset to iterate over?

Camill Trüeb
  • 348
  • 1
  • 2
  • 16
  • You can iterate over all datasets in a group by using group.keys() and checking for instances of h5py.Dataset. See for example: https://stackoverflow.com/questions/34330283/how-to-differentiate-between-hdf5-datasets-and-groups-with-h5py – NoDataDumpNoContribution Mar 07 '19 at 16:41
  • The problem with this is I would like to access the data batchwise, e.g. 32 images at a time. Creating this batch from single group datasets in every epoch again is very slow... – Camill Trüeb Mar 08 '19 at 09:31
  • 1
    You shouldn't have each image as its own dataset, but rather a large dataset whose first axis represent images. So a stack of 10 256x256 RGB images should be a dataset with shape [10, 256, 256, 3] – Yngve Moe Mar 09 '19 at 18:52
  • Thank you! I realized the dataset creation can be sped up a lot by not compressing the data and not reshaping the dataset every iteration. – Camill Trüeb Mar 11 '19 at 09:28
  • The most important things are chunk_shape and chunk_cache. The documentation isn't very good on this topics. eg. https://stackoverflow.com/a/48405220/4045774 Quite common errors are also opening/closing the hdf5-file on every iteration. If you do it the right way, you should easily reach the sequential IO-speed of a HDD or SATA-SSD. But without a code sample it is hard to say why your implementation is so slow. – max9111 Jun 14 '19 at 08:10

1 Answers1

0

I realize this is an old question, but I found a very helpful resource on this subject that I wanted to share:

https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html

Basically, the hdf5 (with chunks enabled) is like a little filesystem. It stores data in chunks scattered throughout memory. So like a filesystem, it benefits from locality. If the chunks are the same shape as the array sections you're trying to access, reading/writing will be fast. If the data you're looking for is scattered across multiple chunks, access will be slow.

So in the case of training a NN on images, you're probably going to have to make the images a standard size. Set chunks=(1,) + image_shape, or even better, chunks=(batch_size,) + image_shape when creating the dataset, and reading/writing will be a lot faster.

Lugh
  • 100
  • 1
  • 9