I have a simple multiprocessing code:
from multiprocessing import Pool
import time
def worker(data):
time.sleep(20)
if __name__ == "__main__":
numprocs = 10
pool = Pool(numprocs)
a = ['a' for i in xrange(1000000)]
b = [a+[] for i in xrange(100)]
data1 = [b+[] for i in range(numprocs)]
data2 = [data1+[]] + ['1' for i in range(numprocs-1)]
data3 = [['1'] for i in range(numprocs)]
#data = data1
#data = data2
data = data3
result = pool.map(worker,data)
b
is just a large list. data
is a list of length numprocs
passed to pool.map, so I expect numprocs
processes to be forked and each element of data
to be passed to one of those.
I test 3 different data
objects: data1
and data2
have practically the same size, but when using data1
, each process gets a copy of the same object, whereas when using data2
, one process gets all of data1
and others get just a '1' (basically nothing). data3
is basically empty to measure the basic overhead cost of forking processes.
Problem:
The overall memory used is vastly different between data1
and data2
. I measure the amount of additional memory used by the last line (pool.map()) and I get:
data1
: ~8GBdata2
: ~0.8GBdata3
: ~0GB
Shouldn't 1) and 2) be equal because the total amount of data passed to the children is the same. What is going on?
I measure memory usage from the Active
field of /proc/meminfo on a Linux machine (Total - MemFree
gives the same answer)