Following the documentation for shared memory here, I have implemented a minimal example of accessing NumPy arrays backed with shared memory in a function called by a worker process in a pool. My assumption is that this code should produce minimal memory overhead for each additional worker (there is some overhead to copy the interpreter and non-shared variables, but the 16GB of memory should not be copied.)
import numpy as np
from multiprocessing import Pool, shared_memory
from itertools import product
from tqdm import tqdm
if __name__ == "__main__":
a_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
a = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=a_shared_memory.buf)
b_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
b = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=b_shared_memory.buf)
def test_func(args):
a[args] + b[args[:-1]]
with tqdm(total=20 * 100 * 100 * 100) as pbar:
with Pool(16) as pool:
for _ in pool.imap_unordered(test_func,
product(range(20), range(100), range(100), range(100)),
chunksize=16):
pass
However, in practice when running this code memory usage grows in each process over time, both in the RES
memory metric as well as the SHR
memory metric as reported by top. (The rate of accumulation of memory can be modified with the size of the arrays being selected inside the test_func
function.)
This behavior is confusing to me – these arrays are in shared memory, and I would therefore assume that a view of them shouldn't incur any memory allocation (I am testing on linux, so no copying should occur only with reading.) Further, I don't even store the results of this computation anywhere, so it is unclear why memory is being allocated.
Two further notes:
According to this answer, even reading / accessing an array from shared memory will force a copy + write, since the refcount must be updated. However this should only affect the header memory page, which should be about 4kb. Why does memory continue to grow?
If I simply change the code in the following way:
def test_func(args):
a[args], b[args[:-1]]
the issues resolve – there is no memory overhead (ie. memory is shared,) and no increasing memory allocation over time.
I've tried to present the simplest, most intuitive application of the documentation to multiprocessing with shared memory, yet it remains very unclear to me how and why it isn't working as expected. I would like to perform some simple calculations in the test_func
, including viewing the shared memory, adding, matrix - vector multiplication etc. Any help in getting a better grasp of how to use shared memory correctly would be very appreciated.
Update:
When I change the test_func
code to a[0, 0, 0, 0] + b[0, 0, 0]
the issue disappears. Does this mean that there is some reference counter in the middle of the NumPy arrays? Such that when args
is changing, different parts of the array are accessed and memory increases, but if the indexes are always the same, the memory doesn't increase.