0

I have a dataset that is too large to load all of it into memory. So instead, my thought is to load half the dataset, train on 2 epochs, then delete that data and load the other half, train, and repeat. However, even though I delete the data after each 2 epochs, it still crashes due to it being out of RAM.

I have tried training on each half of the dataset individually and it works. But when I try to create a loop that trains on half, deletes that half, then trains on the other half, it crashes. Keep in mind, it's not the GPU RAM, it's the system RAM I'm running out of.

here is the loop:

for i in range(20):
  if i % 2 == 0:
    X_train = np.load('/content/drive/My Drive/Kaggle ISLR/X_train_batch1.npy')
    y_train = np.load('/content/drive/My Drive/Kaggle ISLR/y_train_batch1.npy')
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    train_gen = MyDataGenerator(X_train, y_train, batch_size, first_dim)
    val_gen = MyDataGenerator(X_val, y_val, batch_size, first_dim)
    del X_train
    del y_train
    model.fit(
    train_gen,
    validation_data=val_gen,
    epochs=2,
    callbacks=[checkpoint_callback]
    )
    del train_gen
    del val_gen
    
  else:
    X_train = np.load('/content/drive/My Drive/Kaggle ISLR/X_train_batch2.npy')
    y_train = np.load('/content/drive/My Drive/Kaggle ISLR/y_train_batch2.npy')
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    train_gen = MyDataGenerator(X_train, y_train, batch_size, first_dim)
    val_gen = MyDataGenerator(X_val, y_val, batch_size, first_dim)
    del X_train
    del y_train
    model.fit(
    train_gen,
    validation_data=val_gen,
    epochs=2,
    callbacks=[checkpoint_callback]
    )
    del train_gen
    del val_gen

Is there a better way to do this to prevent running out of RAM?

Conweezy
  • 105
  • 5
  • 15
  • 4
    `del` doesn't really delete and reclaim the data. It just decrements ref count. So, if something else is still referencing the data, the memory will not be reclaimed. – Super-intelligent Shade Mar 19 '23 at 18:43
  • I would suggest using [input pipelines](https://www.tensorflow.org/guide/data), which is the tensorflow way of doing it. You may have to save the data in a different format though. – Super-intelligent Shade Mar 19 '23 at 18:44
  • @Super-intelligentShade can you write an answer based on that? – MattDMo Mar 19 '23 at 18:50
  • `del` isn't really doing anything useful here. – juanpa.arrivillaga Mar 19 '23 at 18:51
  • @MattDMo I can, but I am too lazy (plus the OP didn't specify what data is stored in the datasets). And, it looks like someone has already figured it out anyway: https://stackoverflow.com/a/50932872/4358570 – Super-intelligent Shade Mar 19 '23 at 18:57
  • 1
    @Super-intelligentShade thanks for the dupe target, I've hammered the question. In the future, if you come across situations like this where another question and its answer(s) address the OP's concerns, just flag the Q as a duplicate and let the community do its thing. – MattDMo Mar 19 '23 at 19:05
  • @MattDMo I didn't flag it because the answer uses pretty "brutal" low-level way of reading the file. Not sure everyone would be ready to stomach that. IMHO this question should be left open. – Super-intelligent Shade Mar 19 '23 at 19:08
  • @OP, [np.memmap](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html) may also be of interest to you. To quote: _"Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects."_ You can (and should) still tie it in with `tf.Dataset` IMHO. – Super-intelligent Shade Mar 19 '23 at 19:09
  • 1
    @Super-intelligentShade Duplicates link to questions, not specific answers. In this case (and many others), there are multiple answers to choose from, or blend together. One of the answers linked to yet another Q&A that seems to address this situation as well, so I added it to the list of duplicate questions for this current question. I think the OP has enough avenues of research now that they can decide which method is best for them. As far as the low-level stuff is concerned, sometimes that's what needs to be done to wrangle one data format into another. – MattDMo Mar 19 '23 at 19:18

0 Answers0