Though the script does use quite a bit of memory even with the "smaller" example values, the answer to
Does Python clone my entire environment each time the parallel
processes are run, including the variables not required by f()? How
can I prevent this behaviour?
is that it does in a way clone the environment with forking a new process, but if copy-on-write semantics are available, no actual physical memory needs to be copied until it is written to. For example on this system
% uname -a
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
COW
seems to be available and in use, but this may not be the case on other systems. On Windows this is strictly different as a new Python interpreter is executed from .exe
instead of forking. Since you mention using htop
, you're using some flavour of UNIX or UNIX like system, and you get COW
semantics.
For each iteration of the for loop the processes in simulation() each
have a memory usage equal to the total memory used by my code.
The spawned processes will display almost identical values of RSS
, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map
the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC
and submitted data will be copied. In your example the IPC
and 2**20 function calls also dominate the CPU usage. Replacing the mapping with a single vectorized multiplication in simulation
took the script's runtime from around 150s to 0.66s on this machine.
We can observe COW
with a (somewhat) simplified example that allocates a large array and passes it to a spawned process for read-only processing:
import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil
def read_arr(arr, done, stop):
with done:
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
def main():
# Create a large array
print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
A = np.random.random(2**28)
print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
done = Condition()
stop = Event()
p = Process(target=read_arr, args=(A, done, stop))
with done:
p.start()
done.wait()
print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
stop.set()
p.join()
if __name__ == '__main__':
main()
Output on this machine:
% python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...
Now if we replace the function read_arr
with a function that modifies the array:
def mutate_arr(arr, done, stop):
with done:
arr[::4096] = 1
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
the results are quite different:
Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...
The for-loop does indeed require more memory after each iteration, but that's obvious: it stacks the total_results
from the mapping, so it has to allocate space for a new array to hold both the old results and the new and free the now unused array of old results.