I have a gigantic training data set that couldn't fit in RAM. I tried to load random batch of images in a stack without loading whole .h5. My approach was to create a list of indexes and shuffle them instead of shuffling the whole .h5 file. Let's say:
a = np.arange(2000*2000*2000).reshape(2000, 2000, 2000)
idx = np.random.randint(2000, size = 800) #so that I only need to shuffle this idx at the end of epoch
# create this huge data 32GBs > my RAM
with h5py.File('./tmp.h5', 'w') as f:
tmp = f.create_dataset('a', (2000, 2000, 2000))
tmp[:] = a
# read it
with h5py.File('./tmp.h5', 'r') as f:
tensor = f['a'][:][idx] #if I don't do [:] there will be error if I do so it will load whole file which I don't want
Does somebody has a solution?