4

I want to use an numpy file (.npy) from Google Drive into Google Colab without importing it into the RAM.

I am working on Image Classification and have my image data into four numpy files in Google Drive. The collective size of the files is greater than 14 GB. Whereas Google Colab only offers 12 GB RAM for usage. Is there a way through which I can use it by loading only single batch at a time into the ram to train the model and removing it from the ram (maybe similar to flow_from_directory)?

The problem using flow_from_directory is that it is very slow even for one block of VGG16 even if I have images in Colab directory.

I am using Cats vs Dogs Classifier dataset from Kaggle.

! kaggle competitions download -c 'dogs-vs-cats'

I converted the image data into numpy array, and saved it in 4 files:

X_train - float32 - 10.62GB - (18941, 224, 224, 3)

X_test - float32 - 3.4GB - (6059, 224, 224, 3)

Y_train - float64 - 148KB - (18941)

Y_test - float64 - 47KB - (6059)

When I run the following code, the session crashes showing 'Your session crashed after using all available RAM.' error.

import numpy as np
X_train = np.load('Cat_Dog_Classifier/X_train.npy')
Y_train = np.load('Cat_Dog_Classifier/Y_train.npy')
X_test = np.load('Cat_Dog_Classifier/X_test.npy')
Y_test = np.load('Cat_Dog_Classifier/Y_test.npy')

Is there any way to use these 4 files without loading it into the RAM?

Alessandro
  • 2,848
  • 1
  • 8
  • 16
Rahul Vishwakarma
  • 1,446
  • 2
  • 7
  • 22
  • Do you have an [MCVE](https://stackoverflow.com/help/minimal-reproducible-example)? – Han-Kwang Nienhuys Jun 21 '20 at 13:01
  • MCVE for which part. The problem is overloading of RAM memory. – Rahul Vishwakarma Jun 21 '20 at 13:10
  • Check it out, I have added some – Rahul Vishwakarma Jun 21 '20 at 13:19
  • And exactly where do you run out of memory? You don't need to provide all the code that generates the data. Just describe (1) how large the files are, (2) what the dtype and array sizes in those files are, (3) the processing code that fails on those files. – Han-Kwang Nienhuys Jun 21 '20 at 13:29
  • Does this answer your question? [Finding shape of saved numpy array (.npy or .npz) without loading into memory](https://stackoverflow.com/questions/35990775/finding-shape-of-saved-numpy-array-npy-or-npz-without-loading-into-memory) ... or possibly [loading arrays saved using numpy.save in append mode](https://stackoverflow.com/questions/35747614/loading-arrays-saved-using-numpy-save-in-append-mode) – wwii Jun 21 '20 at 14:10
  • Check it out @Han-KwangNienhuys where it fails. Sorry for the framing of question, I am not used to with this. – Rahul Vishwakarma Jun 21 '20 at 14:15
  • 1
    Also Related: [Efficient way to partially read large numpy file?](https://stackoverflow.com/questions/42727412/efficient-way-to-partially-read-large-numpy-file) .. [How to partial load an array saved with numpy save in python](https://stackoverflow.com/questions/34540585/how-to-partial-load-an-array-saved-with-numpy-save-in-python) - There are more searching with variations of `python numpy load part of .npy site:stackoverflow.com` – wwii Jun 21 '20 at 14:16

2 Answers2

4

You can do this by opening your file as a memory-mapped array.

For example:

import sys
import numpy as np

# Create a npy file
x = np.random.rand(1000, 1000)
np.save('mydata.npy', x)

# Load as a normal array
y = np.load('mydata.npy')
sys.getsizeof(y)
# 8000112

# Load as a memory-mapped array
y = np.load('mydata.npy', mmap_mode='r')
sys.getsizeof(y)
# 136

The second array acts like a normal array, but is backed by disk rather than RAM. Be aware that this will cause operations over the arrays to be much slower than normal RAM-backed arrays; often mem-mapping is used to conveniently access portions of the array without having to load the full array into RAM.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
0

The combined size of the files is 14 GB, which is indeed greater than the 12 GB that you say you have available. However, you created those files from data that was in memory as well, as in an earlier version of your question, which suggests that there is enough memory to hold all the data:

save('drive/My Drive/ML/Cats_vs_Dogs_Classifier/X_train.npy', X_train)
save('drive/My Drive/ML/Cats_vs_Dogs_Classifier/Y_train.npy', Y_train)
save('drive/My Drive/ML/Cats_vs_Dogs_Classifier/X_test.npy', X_test)
save('drive/My Drive/ML/Cats_vs_Dogs_Classifier/Y_test.npy', Y_test)

However, if you attempt to load the X_train file again in the same Python session (I assume you're using Jupyter Notebook), you'll temporarily need another 10.6 GB of memory before the 10.6 GB occupied by the previous X_train is released.

You can pick on of the following strategies:

  • Start a new Python process (or kernel) before loading data.
  • Explicitly free the memory before continuing:
    del X_train, Y_train, X_test, Y_test
    
  • Put the code that generates the data inside a function. All local variables created in the function will automatically be deleted when the function returns.
Han-Kwang Nienhuys
  • 3,084
  • 2
  • 12
  • 31