7

let say I have some big matrix saved on disk. storing it all in memory is not really feasible so I use memmap to access it

A = np.memmap(filename, dtype='float32', mode='r', shape=(3000000,162))

now let say I want to iterate over this matrix (not essentially in an ordered fashion) such that each row will be accessed exactly once.

p = some_permutation_of_0_to_2999999()

I would like to do something like that:

start = 0
end = 3000000
num_rows_to_load_at_once = some_size_that_will_fit_in_memory()
while start < end:
    indices_to_access = p[start:start+num_rows_to_load_at_once]
    do_stuff_with(A[indices_to_access, :])
    start = min(end, start+num_rows_to_load_at_once)

as this process goes on my computer is becoming slower and slower and my RAM and virtual memory usage is exploding.

Is there some way to force np.memmap to use up to a certain amount of memory? (I know I won't need more than the amount of rows I'm planning to read at a time and that caching won't really help me since I'm accessing each row exactly once)

Maybe instead is there some other way to iterate (generator like) over a np array in a custom order? I could write it manually using file.seek but it happens to be much slower than np.memmap implementation

do_stuff_with() does not keep any reference to the array it receives so no "memory leaks" in that aspect

thanks

cs95
  • 379,657
  • 97
  • 704
  • 746
user2717954
  • 1,822
  • 2
  • 17
  • 28
  • Try to [flush](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.flush.html), maybe that helps. – yar Jul 16 '17 at 20:48
  • 1
    @yar I'll try but it sounds weird, after all my memmap is read only so flush shouldn't really have any effect – user2717954 Jul 17 '17 at 05:21
  • @yar flush did not work. rss memory usage stays the same – user2717954 Jul 17 '17 at 05:42
  • Are you on Windows? I get the same issue on Windows, on Linux everything is working as expected (The caching is done by the os and not by the python interpreter). Would using HDF5 (h5py) also a possibility for you? – max9111 Jul 18 '17 at 11:31
  • @max9111 I'm on linux (Debian GNU/Linux 8 (jessie) 64-bit to be percise) not sure what HDF5 is. I'll look into it – user2717954 Jul 18 '17 at 12:05

2 Answers2

8

This has been an issue that I've been trying to deal with for a while. I work with large image datasets and numpy.memmap offers a convenient solution for working with these large sets.

However, as you've pointed out, if I need to access each frame (or row in your case) to perform some operation, RAM usage will max out eventually.

Fortunately, I recently found a solution that will allow you to iterate through the entire memmap array while capping the RAM usage.

Solution:

import numpy as np

# create a memmap array
input = np.memmap('input', dtype='uint16', shape=(10000,800,800), mode='w+')

# create a memmap array to store the output
output = np.memmap('output', dtype='uint16', shape=(10000,800,800), mode='w+')

def iterate_efficiently(input, output, chunk_size):
    # create an empty array to hold each chunk
    # the size of this array will determine the amount of RAM usage
    holder = np.zeros([chunk_size,800,800], dtype='uint16')

    # iterate through the input, replace with ones, and write to output
    for i in range(input.shape[0]):
        if i % chunk_size == 0:
            holder[:] = input[i:i+chunk_size] # read in chunk from input
            holder += 5 # perform some operation
            output[i:i+chunk_size] = holder # write chunk to output

def iterate_inefficiently(input, output):
    output[:] = input[:] + 5

Timing Results:

In [11]: %timeit iterate_efficiently(input,output,1000)
1 loop, best of 3: 1min 48s per loop

In [12]: %timeit iterate_inefficiently(input,output)
1 loop, best of 3: 2min 22s per loop

The size of the array on disk is ~12GB. Using the iterate_efficiently function keeps the memory usage to 1.28GB whereas the iterate_inefficiently function eventually reaches 12GB in RAM.

This was tested on Mac OS.

Jack
  • 91
  • 1
  • 2
  • 1
    You're building an in-memory array the size of `input` with `input[:] + 5`. – user2357112 Aug 23 '17 at 06:37
  • 1
    Also, the questioner claims to already be processing input in chunks. – user2357112 Aug 23 '17 at 06:38
  • @user2357112, he is doing it in his iterate_inefficiently function for comparison. (I haven't read thoroughly the iterate_efficiently yet so can't say much about it) – user2717954 Aug 23 '17 at 07:57
  • 1
    @Jack seems the only difference between your approach and mine is that you allocate the holder once where I do it in each iteration. I tried your approach but the memory usage was basically the same. thanks for the effort though – user2717954 Aug 23 '17 at 10:21
  • @user2717954 How are you measuring the RAM usage? I tested this on Linux (Ubuntu) as well and it seems to be working fine as determined by the resource monitor, (2.2GB/8GB) throughout the entire iteration. My computer remains responsive and does not slow down by any noticeable amount. – Jack Aug 23 '17 at 16:35
  • @Jack I'm examining the /proc/self/status – user2717954 Aug 24 '17 at 05:39
  • I'm trying to use it for a data generator but for some reason it is not working. – Marlon Teixeira Aug 25 '20 at 12:51
  • https://stackoverflow.com/questions/63584973/how-to-use-numpy-memmap-inside-keras-generator-to-not-exceed-ram-memory – Marlon Teixeira Aug 26 '20 at 00:06
8

I've been experimenting with this problem for a couple days now and it appears there are two ways to control memory consumption using np.mmap. The first is reliable while the second would require some testing and will be OS dependent.

Option 1 - reconstruct the memory map with each read / write:

def MoveMMapNPArray(data, output_filename):
    CHUNK_SIZE = 4096
    for idx in range(0,x.shape[1],CHUNK_SIZE):
        x = np.memmap(data.filename, dtype=data.dtype, mode='r', shape=data.shape, order='F')
        y = np.memmap(output_filename, dtype=data.dtype, mode='r+', shape=data.shape, order='F')
        end = min(idx+CHUNK_SIZE, data.shape[1])
        y[:,idx:end] = x[:,idx:end]

Where data is of type np.memmap. This discarding of the memmap object with each read keeps the array from being collected into memory and will keep memory consumption very low if the chunk size is low. It likely introduces some CPU overhead but was found to be small on my setup (MacOS).

Option 2 - construct the mmap buffer yourself and provide memory advice

If you look at the np.memmap source code here, you can see that it is relatively simple to create your own memmapped numpy array relatively easily. Specifically, with the snippet:

mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap_np_array = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm, offset=array_offset, order=order)

Note this python mmap instance is stored as the np.memmap's private _mmap attribute.

With access to the python mmap object, and python 3.8, you can use its madvise method, described here.

This allows you to advise the OS to free memory where available. The various madvise constants are described here for linux, with some generic cross platform options specified.

The MADV_DONTDUMP constant looks promising but I haven't tested memory consumption with it like I have for option 1.

Matt
  • 310
  • 3
  • 10
  • https://stackoverflow.com/questions/63584973/how-to-use-numpy-memmap-inside-keras-generator-to-not-exceed-ram-memory – Marlon Teixeira Aug 26 '20 at 00:11
  • Additional Note: If you're testing memory and CPU, I'd encourage using [Memory Profiler](https://pypi.org/project/memory-profiler/) and [Line Profiler](https://pypi.org/project/line-profiler/) respectively. – Matt Oct 02 '20 at 04:25