How to train Models with numpy arrays bigger than 6GB?

Question

I have couple of huge training files I am planning to train. The validation data is also perfect and I see no problem but the SIZE is huge. I am talking about 20GB+. Loading one file crashes python due to Memory error

I have tried making the file to one but it's too big

X = np.load('X150.npy')
Y = np.load('Y150.npy')

Error

~\AppData\Roaming\Python\Python37\site-packages\numpy\lib\format.py in read_array(fp, allow_pickle, pickle_kwargs)
    710         if isfileobj(fp):
    711             # We can use the fast fromfile() function.
--> 712             array = numpy.fromfile(fp, dtype=dtype, count=count)
    713         else:
    714             # This is not a real file. We have to read it the

MemoryError:

I need a solution so I can train huge datasets.

Once in a need to do any serious computations and train ( .fit() ) ML-predictors and do any further analyses with N [GB] files, the best next step is to acquire an appropriate computing platform. Having less RAM than your data-file sizes is possible ( numpy.memmap-files are a cheap option to work via a small cached data-window into a disk-only data-object ) **but one will pay an immense costs** of having **~ 10000-100000x slower random-access times** than having all data kept inside an in-RAM-layout. Have used .memmap()-s for ML for a few years, before having affordable RAM-enough sized fabric — user3666197, Jul 21 '19 at 09:12

score 2 · Accepted Answer · answered Jul 21 '19 at 05:07

Important: First make sure that your python is 64bit. The methods below only support files upto 2GB for 32bit python versions

Typically, one should use np.memmap() to use the array without loading on to the RAM. From the numpy docs, "Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory."

Example usage:

x_file = "X_150.npy"

X = np.memmap(x_file, dtype='int', mode='w+', shape=(300000, 1000))

However, since your files as already stored as .npy files, I stumbled upon np.lib.format.open_memmap() which creates or loads memory mapped .npy files.

The usage would be as follows, identical to what you'd do with np.memmap():

x_file = "X_150.npy"

X = np.lib.format.open_memmap(x_file, dtype='int', mode='w+', shape=(300000, 1000))

Here's the docs for the second function (from this answer):

>>> print numpy.lib.format.open_memmap.__doc__

"""
Open a .npy file as a memory-mapped array.

This may be used to read an existing file or create a new one.

Parameters
----------
filename : str
    The name of the file on disk. This may not be a filelike object.
mode : str, optional
    The mode to open the file with. In addition to the standard file modes,
    'c' is also accepted to mean "copy on write". See `numpy.memmap` for
    the available mode strings.
dtype : dtype, optional
    The data type of the array if we are creating a new file in "write"
    mode.
shape : tuple of int, optional
    The shape of the array if we are creating a new file in "write"
    mode.
fortran_order : bool, optional
    Whether the array should be Fortran-contiguous (True) or
    C-contiguous (False) if we are creating a new file in "write" mode.
version : tuple of int (major, minor)
    If the mode is a "write" mode, then this is the version of the file
    format used to create the file.

Returns
-------
marray : numpy.memmap
    The memory-mapped array.

Raises
------
ValueError
    If the data or the mode is invalid.
IOError
    If the file is not found or cannot be opened correctly.

See Also
--------
numpy.memmap
"""

How to train Models with numpy arrays bigger than 6GB?

1 Answers1