How to use read-only, shared memory (as NumPy arrays) in multiprocessing

Question

Following the documentation for shared memory here, I have implemented a minimal example of accessing NumPy arrays backed with shared memory in a function called by a worker process in a pool. My assumption is that this code should produce minimal memory overhead for each additional worker (there is some overhead to copy the interpreter and non-shared variables, but the 16GB of memory should not be copied.)

import numpy as np
from multiprocessing import Pool, shared_memory
from itertools import product
from tqdm import tqdm

if __name__ == "__main__":
    a_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
    a = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=a_shared_memory.buf)
    b_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
    b = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=b_shared_memory.buf)
            
    def test_func(args):
        a[args] + b[args[:-1]]
        
    with tqdm(total=20 * 100 * 100 * 100) as pbar:  
        with Pool(16) as pool:
            for _ in pool.imap_unordered(test_func,
                                         product(range(20), range(100), range(100), range(100)),
                                         chunksize=16):
                pass

However, in practice when running this code memory usage grows in each process over time, both in the RES memory metric as well as the SHR memory metric as reported by top. (The rate of accumulation of memory can be modified with the size of the arrays being selected inside the test_func function.)

This behavior is confusing to me – these arrays are in shared memory, and I would therefore assume that a view of them shouldn't incur any memory allocation (I am testing on linux, so no copying should occur only with reading.) Further, I don't even store the results of this computation anywhere, so it is unclear why memory is being allocated.

Two further notes:

According to this answer, even reading / accessing an array from shared memory will force a copy + write, since the refcount must be updated. However this should only affect the header memory page, which should be about 4kb. Why does memory continue to grow?
If I simply change the code in the following way:

def test_func(args):
    a[args], b[args[:-1]]

the issues resolve – there is no memory overhead (ie. memory is shared,) and no increasing memory allocation over time.

I've tried to present the simplest, most intuitive application of the documentation to multiprocessing with shared memory, yet it remains very unclear to me how and why it isn't working as expected. I would like to perform some simple calculations in the test_func, including viewing the shared memory, adding, matrix - vector multiplication etc. Any help in getting a better grasp of how to use shared memory correctly would be very appreciated.

Update: When I change the test_func code to a[0, 0, 0, 0] + b[0, 0, 0] the issue disappears. Does this mean that there is some reference counter in the middle of the NumPy arrays? Such that when args is changing, different parts of the array are accessed and memory increases, but if the indexes are always the same, the memory doesn't increase.

Are you aware that `a[args] + b[args[:-1]]` will build a brand-new array in memory to hold the sum? The fact that you don't store the reference does not mean it won't get created. The comma operator doesn't have to do that. It builds a two-item tuple with two existing references, then discards the tuple. — Tim Roberts, Feb 06 '23 at 06:31
Use in-place operations such as `+=` and the `out` argument of various methods to achieve what you want — Homer512, Feb 06 '23 at 06:33
@kelly-bundy The shared memory is initialized to be 8GB each, so 16GB total. If you mean the memory usage over time, after about 1 minute of running this program it reaches above 1GB for each process, and continues to grow. I'm not sure if it's bounded, but certainly using a lot of memory — Acoop, Feb 06 '23 at 15:13
I am aware that a brand new array will be created temporarily – that's fine. This is just a minimal example. My intention is to use the intermediate results in further computation. Nonetheless, I tried the following snippet `out = np.zeros((100, 100)); np.add(a[args], b[args[:-1]], out=out)` and the same problem occurred. Note: I would assume that this temporary object is garbage collected as all references are lost when the function exits. Further, I am not trying to mutate or write to a or b. I am simply trying to read from them — Acoop, Feb 06 '23 at 15:22

score 3 · Accepted Answer · answered Feb 06 '23 at 22:56

However, in practice when running this code memory usage grows in each process over time, both in the RES memory metric as well as the SHR memory metric as reported by top.

This is normal, but this is not because of a copy nor any allocation done from the interpreter. This is because of page faults and virtual memory. Indeed, shared memory buffer is created and have a virtual addresse space, but an operating system (OS) like Linux does not directly map it physically in RAM. This is because reserving the space would not be efficient as many application allocate space they do not fully use, or at least not directly. Linux maps the virtual pages to physical pages during the first touch, that is, during the first read or write. For security reasons, Linux fill the mapped pages with zeros, even when you just read them (since the RAM may contains password from other sensitive applications like your browser). The growing memory is due to pages being slowly filled with zeros and mapped to physical memory.

If you do not want this to happen, you can just fill the array with zeros manually using just a.fill(0) and b.fill(0) before the multiprocessing-based computation. On my Linux machine, this reserve the space in physical memory and to not reserve more space after that.

Note that Linux is an example of operating system doing that but Windows behave quite similarly (AFAIK MacOS too). Also note that some (rare) systems are configured to physically map the memory directly for sake of performance (eg. some game platforms and HPC systems).

When I change the test_func code to a[0, 0, 0, 0] + b[0, 0, 0] the issue disappears.

This is because only the first page of the shared memory buffer is read causing only a first touch on this page (so mapped in physical memory). Other pages are still left untouched and so they are only mapped in vritual memory and not physical memory. At least on mainstream systems like your and mine.

According to this answer, even reading / accessing an array from shared memory will force a copy + write, since the refcount must be updated. However this should only affect the header memory page, which should be about 4kb. Why does memory continue to grow?

This is rather true. However, Numpy arrays do not hold the buffer so the reference counting do not impact the shared buffer but the Numpy array which are actually a view of the shared buffer. In practice, Numpy arrays are always views (although the internal buffer associated to a given array may not be shared by any other instance). Numpy is responsible for allocating and collecting the internal buffer if needed (except for shared buffers like this that are not owned by Numpy).

If I simply change the code in the following way: [...] a[args], b[args[:-1]] [...] the issues resolve.

This is expected, but a bit tricky to understand since it combine the magic of the OS with the one of Numpy. Indeed, a[args] and b[args[:-1]] are Numpy view so they do not read the memory of the shared buffer unless you read the content of view (not done here). If you write a[args][0], then the memory is read and the memory consumption appears to grow. The same thing is true for any Numpy function reading/writing data of the a and b views, like np.sum(a[args]).

Important general notes

Note that the mapped shared memory must be freed using either close on each instance or unlink from the main process. This is critical to avoid a system resource leak.

To prove the share buffer are truly shared, one can test the following (Posix-only) program:

import numpy as np
from multiprocessing import Pool, shared_memory
import os

a_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
a = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=a_shared_memory.buf)
a[0, 0, 0, 0, 0] = 42
print('from init:', a[0, 0, 0, 0, 0])

pid = os.fork()
if pid: # parent
    os.wait()
    print('from parent (after the wait):', a[0, 0, 0, 0, 0])
    a_shared_memory.unlink()
    del a
else: # child
    print('from child (before):', a[0, 0, 0, 0, 0])
    a[0, 0, 0, 0, 0] = 815
    print('from child (after):', a[0, 0, 0, 0, 0])
    exit()

Which prints:

from init: 42.0
from child (before): 42.0
from child (after): 815.0
from parent (after the wait): 815.0

Note on the portability of the code

Your code does not run on my machine running on Windows. It turns out you made assumptions that are at least not portable and certainly non standard. For example, test_func should not be accessible from sub-processes since it is located in the main section and sub-processes does not run it. As a result, there is an error. On Linux and more generally Posix platforms, processes are created using the fork system call. Forked process are almost the same processes (like two cells after a division): from the interpreter point of view, they have a very close memory state : the children processes have a and b defined in the environment as well as a_shared_memory and b_shared_memory. Accessing to them is non standard, but it works on Posix. I think SharedMemoryManager should be used in this context. Alterntatively, I think you can name the shared memory section so to access to them from the child process without accessing global variables (which is a very bad practice in software engineering).

When you say the zero-filling *"reserve the space in physical memory"*, does that mean their total memory usage goes up to 32 GB at that point instead of 16 GB, or what happens? — Kelly Bundy, Feb 07 '23 at 01:09
I did not waited for the code with 8 GiB array to complete because I only have 16 GiB of RAM on my machine. I just tried with smaller array just to see if the memory was stable after the array has been accessed. I remember the memory taken was the size of the two array (which make sense). When zero-filling is used, it should take 16 GiB of RAM directly rather than slowly reserving the space up to 16 GiB. To make it clear, this doesn't take up any more memory, but just reserves the space at once and earlier. — Jérôme Richard, Feb 08 '23 at 20:23

How to use read-only, shared memory (as NumPy arrays) in multiprocessing

1 Answers1

Important general notes

Note on the portability of the code