Why does python create new copies of return values instead of just the reference in multiprocessing

Question

I am dealing with a memory leak issue in my program. I found it has to do with the multiprocessing, so I've come up with the following experiment.

In the experiment, the function f generates a list and a tuple and I am going to check will the ids remain unchanged or not after returning from a function.

The most effective way for program returning value is returning the reference that prevents allocating memory for identical objects. When SYNC = True, the results show the ids from inner are equal to ids receive at outer.

However, when SYNC = False and multiprocessing joined the program, the ids from inner no longer equal to the outer ids. This suggests the program has created extra copies of the objects.

This essentially caused 2 problems:
1. Waste of memory and computing power when duplicating the objects 2. The copy stayed in the Pool will not be garbage collected (found this through other experiments)

Can anyone tell me the mechanism of python handling this and how do I avoid my program become memory devouring after introducing multiprocessing?

from multiprocessing import Pool

SYNC = False

def f(start):
    l = [i for i in range(start, start+100)] # generate a list from start to start-1
    t = tuple(i for i in range(start, start+100)) # generate a list from start to start-1
    print('inner: {}'.format(id(l)))
    print('inner: {}'.format(id(t)))
    return l, t

def iterate(it):
    for l, t in it:
        print('outer: {}'.format(id(l)))
        print('outer: {}'.format(id(t)))

pool = Pool(4)
inputs = [i for i in range(4)]

gen_sync = (f(start) for start in inputs)
gen_async = pool.imap(f, inputs, chunksize=4)

if SYNC:
    print('start testing sync')
    iterate(gen_sync)
else:
    print('start testing async')
    iterate(gen_async)

SYNC = True

start testing sync
inner: 139905123267144
inner: 23185048
outer: 139905123267144
outer: 23185048
inner: 139905123249544
inner: 23186776
outer: 139905123249544
outer: 23186776
inner: 139905123267144
inner: 23187640
outer: 139905123267144
outer: 23187640
inner: 139905123249544
inner: 23185912
outer: 139905123249544
outer: 23185912
inner: 139905142421000
inner: 23180456
inner: 139905123267144
inner: 23182184
inner: 139905123249544
inner: 23183912
inner: 139905123249800
inner: 23185640

SYNC = False

start testing async
inner: 139699492382216
inner: 38987640
inner: 139699490987656
inner: 38989368
inner: 139699490985992
inner: 38991096
inner: 139699490986120
inner: 38992824
outer: 139699490985992
outer: 139699180021064
outer: 139699490986120
outer: 139699180022888
outer: 139699473207560
outer: 139699180024712
outer: 139699473207880
outer: 139699180026536

Well, it's `multiprocessing`... your function is running in multiple Python interpreters. To pass data to the function and return it, the data must be serialized and deserialized, which is a form of copying. It couldn't work any other way. — kindall, Apr 17 '19 at 04:26

rdas · Accepted Answer · 2019-04-17T04:32:41.780

I don't think you've understood how multiprocessing works. multiprocessing spins up new python processes to run your code. Each process has its own memory space. When you pass inputs to the map, each process gets a copy of the data in it's own memory space. See this answer which talks about it: Python multiprocessing and a shared counter

If you really want a single copy of the data - you should use Shared Memory. WARNING: It's quite bothersome to use.

https://docs.python.org/dev/library/multiprocessing.shared_memory.html

Here's an example from the docs:

>>> with SharedMemoryManager() as smm:
...     sl = smm.ShareableList(range(2000))
...     # Divide the work among two processes, storing partial results in sl
...     p1 = Process(target=do_work, args=(sl, 0, 1000))
...     p2 = Process(target=do_work, args=(sl, 1000, 2000))
...     p1.start()
...     p2.start()  # A multiprocessing.Pool might be more efficient
...     p1.join()
...     p2.join()   # Wait for all work to complete in both processes
...     total_result = sum(sl)  # Consolidate the partial results now in sl

Thanks for the information, I'll go through more documentation about multiprocessing. How'bout the garbage collection part? The Pool in my program is to retrieve AI training data and will not be terminated while the training is still on. The result of this is memory usage keep grwoing, I am suspecting this has to do with the garbage collection mechanism in the Pool. — PeterHsu, Apr 17 '19 at 04:37

Why does python create new copies of return values instead of just the reference in multiprocessing

1 Answers1