4

I have an original hdf5 file with a dataset thats of shape (3737, 224, 224, 3) and it was not extendable. i.e. no maxshape argument passed during its creation.

I decided to create a new hdf5 file and create the dataset with maxshape=(None, 224, 224, 3) such that I can resize it later. I then just copied the dataset from the original hdf5 to this new one, and saved.

The contents of the two hdf5 are exactly the same. I then tried to read all the data back, and I found significant performance degradation for the resizable version.

Original: CPU times: user 660 ms, sys: 2.58 s, total: 3.24 s Wall time: 6.08 s

Resizable: CPU times: user 18.6 s, sys: 4.41 s, total: 23 s Wall time: 49.5 s

Thats almost 10 times as slow. Is this to be expected? The file size difference is only less than 2 mb. Are there optimization tips/tricks I need to be aware of?

kawingkelvin
  • 3,649
  • 2
  • 30
  • 50
  • You need to set a proper chunk-cache and ajust the chunksize to your reading or writing pattern. This is quite a usual problem and should be added to the h5py documentation. Have a look at https://stackoverflow.com/a/48405220/4045774 – max9111 Apr 17 '18 at 12:38

1 Answers1

3

Upon reading the hdf5 doc carefully, it seemed that if you specify a maxshape during creation of a dataset (which enable it to be resizable in the future) also turned chunking on. This seemed to be mandatory. And the "default" chunking size it gives me by default is dataset.chunks = (234, 14, 28, 1).

According to doc, this means data are not contiguous but "haphazardly" stored in a b-tree like structure. This most likely explains the slowness I observed, it is probably doing much more extra i/o than I thought.

I set the chunk size to be the entire dataset size by passing "chunks=(3737, 224, 224, 3)" and this time, I got

CPU times: user 809 µs, sys: 837 ms, total: 838 ms Wall time: 914 ms

Thats a big speed up in loading my (3737, 224, 224, 3) tensor. I sort of understand why chunking is a scalability solution. But the fact that it magically assign a chunk size is confusing. My context is mini-batch training for deep learning. So the optimal thing is each chunk is a mini-batch.

kawingkelvin
  • 3,649
  • 2
  • 30
  • 50