Read randomly elements in .h5 file without loading whole matrix

Question

I have a gigantic training data set that couldn't fit in RAM. I tried to load random batch of images in a stack without loading whole .h5. My approach was to create a list of indexes and shuffle them instead of shuffling the whole .h5 file. Let's say:

a = np.arange(2000*2000*2000).reshape(2000, 2000, 2000)
idx = np.random.randint(2000, size = 800) #so that I only need to shuffle this idx at the end of epoch

# create this huge data 32GBs > my RAM
with h5py.File('./tmp.h5', 'w') as f:
     tmp = f.create_dataset('a', (2000, 2000, 2000))
     tmp[:] = a

# read it
with h5py.File('./tmp.h5', 'r') as f:
     tensor = f['a'][:][idx] #if I don't do [:] there will be error if I do so it will load whole file which I don't want

Does somebody has a solution?

What's the error? Have you read http://docs.h5py.org/en/stable/high/dataset.html#fancy-indexing? — hpaulj, Mar 11 '19 at 21:19
Using `[:]` loads the array, allowing you to use `[idx]` on the resulting numpy array. — hpaulj, Mar 11 '19 at 21:20
1) Indices have to be unique and strictly increasing. Than you can load a part of the dset using tensor = dset['a'][:,idx,:] If performance is of any concern you also have to think about chunshape and chunkchache https://stackoverflow.com/a/48405220/4045774 — max9111, Mar 12 '19 at 08:41

score 0 · Accepted Answer · answered Apr 03 '19 at 22:37

Thanks to @max9111, here's how I propose to solve it:

batch_size = 100 
idx = np.arange(2000)
# shuffle
idx = np.random.shuffle(idx)

Due to the constraint of h5py:

Selection coordinates must be given in increasing order

One should sort before reading:

for step in range(epoch_len // batch_size):
     try:
          with h5py.File(path, 'r') as f:
               return f['img'][np.sort(idx[step * batch_size])], f['label'][np.sort(idx[step * batch_size])]
     except:
          raise('epoch finished and drop the remainder')

Read randomly elements in .h5 file without loading whole matrix

1 Answers1