1

I have a large HDF5 file containing 16000 different 512x512 numpy arrays. obviously reading the file to the ram will make it crash (Total size of the file 40 GB).

I want to load this array into data and then split data into train_x and test_x. Tha labels are stored locally.

I did this which only create a path to the file without fetching

    h5 = h5py.File('/file.hdf5', 'r')
    data = h5.get('data')

but when I try to split data into train and test:

x_train= data[0:14000]
y_train= label[0:16000]
x_test= data[14000:]
y_test= label[14000:16000]

I get the error

MemoryError: Unable to allocate 13.42 GiB for an array with shape (14000, 256, 256) and data type float32

I want to load them in batches and train a keras model but obviously previous error doesn't allow me to

model.compile(optimizer=Adam(learning_rate =0.001),loss 
                          ='sparse_categorical_crossentropy',metrics =['accuracy'])
history= model.fit(x_train,y_train,validation_data= 
                         (x_test,y_test),epochs =32,verbose=1)

How can I get around this issue?

Dr. Snoopy
  • 55,122
  • 7
  • 121
  • 140
El3oss
  • 29
  • 6

1 Answers1

1

First, let's describe what you are doing.
This statement returns a h5py object for the dataset named 'data': data = h5.get('data'). It does NOT load the entire dataset into memory (which is good). Note: that statement is more typically written like this: data = h5.['data']. Also, I assume there is a similar call to get a h5py object for the 'label' dataset.

Each of your next 4 statements will return a NumPy array based on the indices and dataset. NumPy arrays are stored in memory, which is why you get the memory error. When the program executes x_train= data[0:14000], you need 13.42 GiB to load the array in memory. (Note: the error implies the arrays are 256x256, not 512x512.)

If you don't have enough RAM to store the array, you will have to "do something" to reduce the memory footprint. Options to consider:

  1. Resize the images from 256x256 (or 512x512) to something smaller and save in new h5 file
  2. Modifiy 'data' to use ints instead of floats and save in new h5 file
  3. Write image data to .npy files and load in batches
  4. Read in fewer images, and train in batches.

I wrote an answer to a somewhat related question that describes h5py behavior with training and testing data, and how to randomized input from .npy files. It might be helpful. See this answer: h5py writing: How to efficiently write millions of .npy arrays to a .hdf5 file?

As an aside, you probably want to randomize your selection of testing and training data (and not simply pick the first 14000 images for training and the last 2000 images for testing). Also, check your indices for y_train= label[0:16000]. I think you will get an error with mismatched x_train and y_train sizes.

kcw78
  • 7,131
  • 3
  • 12
  • 44
  • HI kcw78,Thank you for your answer. your two first suggestions are not doable for me because I need to keep this dataset. I load my labels from a local file directly (stored locally). I am interested to try your two last suggestions but I don't know how. Can you explain me with a dummy code on hoad train in batches in keras, and load in batches? – El3oss May 23 '21 at 13:48
  • I mentioned "train in batches" based on comments I have read about others needing help to read HDF5 in batches. In the past, the `.fit_generator()` function was used with a Python generator to do this. However, TF is in the process of deprecating `.fit_generator()`. If you are using TF 2.2.0 (or higher), you have to use the `.fit()` method. The `.fit()` method can now use generator input and includes data augmentation. You can also use a `tf.data.Dataset()` and loop over slices of the image data. – kcw78 May 23 '21 at 20:10
  • Many thanks for your feedback, do tou know sone sources where I could find more details on how to implement it. as I am not used to this? – El3oss May 23 '21 at 20:25
  • I thought you might ask, so did a little Googling. :-) SO has some good answers. Start with these: [Keras: load images batch wise for large dataset](https://stackoverflow.com/a/47200743/10462884) and [How to split dataset into K-fold without loading the whole dataset at once?](https://stackoverflow.com/a/67058583/10462884) If that doesn't help, Google "keras fit_generator" for some tutorials. You will need to write a Python generator function to read and load a subset of image arrays from your H5 file. – kcw78 May 23 '21 at 21:56
  • 1
    HI kcw, Thanks to your direction I found a quicker way using a python package called h5imagegenerator. Basically: train_generator = HDF5ImageGenerator( src='path/to/train.h5', X_key='images, y_key='labels, scaler=True, labels_encoding='hot', batch_size=32, mode='train') this will basically feed in real time the batches of data to models, and even do some processing if interested. then you could feed it to your model: model.fit_generator(train_generator, validation_data=test_generator,...) it's a solution if anyone needs it – El3oss May 24 '21 at 12:48
  • Hi El3oss. Very Nice! The PyPI folks have already written the generator I was describing. Now that you mention **h5imagegenerator**, I vaguely remember reading about it, but haven't tried to use it. Looks like a great solution. I think you need to add your labels to a `labels` dataset in your h5 file. – kcw78 May 24 '21 at 13:04
  • @ kcw78, yes of course I already have, the only downside is that the labels should be integer and it will take care of hot-encoding, so not great if you had already done that. it is also not the quickest process but it is better than nothing at all. for people who want the package h5imagegenerator: https://pypi.org/project/h5imagegenerator/#description – El3oss May 24 '21 at 14:10