1

I'm trying to load some larger .npy files in cupy with memory mapped mode, but I keep running into OutOfMemoryError .

I thought that since it's being opened in memory mapped mode, this operation shouldn't take much memory since a memory map doesn't actually load the whole array into memory.

I can load these files with np.load just fine, this only seems to happen with cupy.load. My enviroment is Google Colab, with the Tesla K80 GPU. It has about 12 gigs CPU ram, 12 gigs GPU ram, and 350 gb disk space.

Here is a minimal example to reproduce the error:

import os
import numpy as np
import cupy

#Create .npy files. 
for i in range(4):
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 128 ))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# Eventually results in memory error. 
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

Output:

0
1
/usr/local/lib/python3.6/dist-packages/cupy/creation/from_data.py:41: UserWarning: Using synchronous transfer as pinned memory (5120000000 bytes) could not be allocated. This generally occurs because of insufficient host memory. The original error was: cudaErrorMemoryAllocation: out of memory
  return core.array(obj, dtype, copy, order, subok, ndmin)
2
3
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-4-b5c849e2adba> in <module>()
      2 for i in range(4):
      3     print(i)
----> 4     CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

1 frames
/usr/local/lib/python3.6/dist-packages/cupy/io/npz.py in load(file, mmap_mode)
     47     obj = numpy.load(file, mmap_mode)
     48     if isinstance(obj, numpy.ndarray):
---> 49         return cupy.array(obj)
     50     elif isinstance(obj, numpy.lib.npyio.NpzFile):
     51         return NpzFile(obj)

/usr/local/lib/python3.6/dist-packages/cupy/creation/from_data.py in array(obj, dtype, copy, order, subok, ndmin)
     39 
     40     """
---> 41     return core.array(obj, dtype, copy, order, subok, ndmin)
     42 
     43 

cupy/core/core.pyx in cupy.core.core.array()

cupy/core/core.pyx in cupy.core.core.array()

cupy/core/core.pyx in cupy.core.core.ndarray.__init__()

cupy/cuda/memory.pyx in cupy.cuda.memory.alloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory._try_malloc()

OutOfMemoryError: out of memory to allocate 5120000000 bytes (total 20480000000 bytes)

I am also wondering if this is perhaps related to Google Colab and their enviroment/GPU.

For convenience, here is a Google Colab notebook of this minimal code

https://colab.research.google.com/drive/12uPL-ZnKhGTJifZGVdTN7e8qBRRus4tA

talonmies
  • 70,661
  • 34
  • 192
  • 269
SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • So it sounds like pinning any file on disk will take the same amount of space in the GPU RAM as it does on the disk. So, would this mean there is no advantage of having GPU memory mapped data over loading that data right into the GPU Ram? What would be the fastest way method for a GPU to save to/from a disk, if the data was bigger than the GPU RAM ? – SantoshGupta7 Sep 02 '19 at 01:04
  • This answers my question exactly, if you submit this as an answer I would accept it. I am trying to study more about the processes you describe. I Googled 'cuda pin memory cpu gpu ram' but none mentioned (as far as I could tell) that pinning requires CPU not GPU memory. If there's a source that you specifically recommend for my situation, let me know. – SantoshGupta7 Sep 02 '19 at 01:20

1 Answers1

2

The numpy.load mechanism for a disk file when memory-mapped may not require the entire file to be loaded from disk into host memory.

However it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.

Your particular test case appears to be creating 4 disk files of ~5GB size each. These won't all fit in either host or device memory if you have 12GB of each. Therefore I would expect things to fail on the 3rd file load, if not earlier.

It may be possible to use your numpy.load mechanism with mapped memory, and then selectively move portions of that data to the GPU with cupy operations. In that case, the data size on the GPU would still be limited to GPU RAM, for the usual things like cupy arrays.

Even if you could used CUDA pinned "zero-copy" memory, it would still be limited to the host memory size (12GB, here) or less.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • "If your desire is simply to load data from disk, it's not necessary to use this method in cupy or CUDA. " In my case, I'll need to both read and write to disk. Usually the normal methods cover reading. My case will do both a write and read at every training step (of which there could be hundreds of thousands) so the process needs to be very fast. I believe I tested np.memmap <--> pytorch a while back and found that process to be too slow. – SantoshGupta7 Sep 02 '19 at 02:01
  • Well, if you want to use the method you have outlined, and you want to do it on that google colab instance, see if you can work with ~5GB of memory instead of 20GB. You fundamentally don't have enough space to do 20GB. This would be no trouble if you had a Tesla card mounted in a GPU server with say, 128GB of host memory. And if you've read any of the discussion from a google search, you'll find that pinned memory in CUDA is "slow" by comparison to ordinary GPU memory allocations. – Robert Crovella Sep 02 '19 at 02:08
  • My case is that I am training large recommender systems, say 10 million items, each represented by a 128 dimensional embeddings. The colab GPU doesn't seem to be able to handle this many parameters, because I get a "RuntimeError: CUDA error: device-side assert triggered" during training. So while the method I am pursuing is slower, it is decent enough that I can train over my data with reasonable time. Google Colab also now offers 25 gb instances, it looks like you have to crash your system first. Thanks to your insight I can know to pursue this option. – SantoshGupta7 Sep 02 '19 at 02:22
  • I got this message from one of the Cupy developers . "[Ryosuke Okuta, chainer] CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default. https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc You can change default memory allocator if you want to use Unified Memory." . What do you make of this? – SantoshGupta7 Sep 02 '19 at 02:31
  • I just tried running my code with `cupy.cuda.set_allocator(MemoryPool(malloc_managed).malloc)` at the top but didn't see any noticeable difference – SantoshGupta7 Sep 02 '19 at 03:04
  • I think there's a good chance my description is not 100% accurate after reading the doc on `cupy.load` again. I have edited it to remove the questionable statements. In a nutshell, according to my read of `cupy.load`, you will be limited *both* by host and device memory, and it's not clear to me that pinned memory is central to the issue. – Robert Crovella Sep 02 '19 at 03:40
  • I tried with the new 25.51 gb RAM (for CPU, GPU is still 12-something GB I believe) and I got an `OutOfMemoryError: out of memory to allocate 4403458048 bytes (total 13210374144 bytes)` error, when the CPU RAM was only at 16 gb. The GPU memory didn't change at all; it remained 0.32 GB. This was with 4 instances of 2.2 million classes with embedding size 512. I tried both with and without `cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)` to use 'Unified Memory'. Either way, it seems like both the CPU and GPU ram as very underutilized – SantoshGupta7 Sep 02 '19 at 06:02
  • If you're interested, I made a follow up here https://stackoverflow.com/questions/57752516/how-to-use-cuda-pinned-zero-copy-memory-for-a-memory-mapped-file – SantoshGupta7 Sep 02 '19 at 06:50