I'm wondering how Python's multiprocessing module starts up new processes. Here's an example:
import multiprocessing as mp
print 'start'
initial_data = initial_calculations(seed)
def child(data):
"called by mp.Pool"
process data with initial_data into results and return (initial_data does not change)
def setup():
"called by main process"
generate and return data on how to process initial_data
def final(results):
"called by main process"
assemble final results and write to log file
if __name__ == '__main__':
cpus = mp.cpu_count()
data = setup()
p = mp.Pool(cpus)
try:
results=p.map(child, data)
p.close()
p.join()
except:
p.terminate()
p.join()
raise
finalprocess(results)
print 'yay!'
I gather that unix forks this code and so initial_data only calculated once and is shared by all subprocesses started by pool. On windows however, it starts up a new subprocess meaning initial_data is recalculated in each. But how exactly? This is what I'm wondering about. For the example code above, initial_data only takes a few seconds whereas child can take many hours. Data would be a list of arguments to work on that is far longer than the number of suprocesses that can be started.
Say that cpus = 8. Does pool create 8 versions of this, recalculate the initial_data and then run child repeatedly until data is exhausted or does it keep creating new versions of this code(only up to 8 at a time) until the data is exhausted? That is, does pool do something like:
chunk data into equal amounts for each process (subdata)
start each process (printing start and recalculating initial_data)
then in each process
for i in subdata:
child(i)
return results
or does it do something like:
for i in data:
start process (printing start and recalculating initial_data)
child(i)
if numprocesses > 8: wait
return results
?
The first way I would be fine with, as the initial data is calculated only 8 times but the amount of time child takes far outweighs this. My code is currently set up thinking it is in this manner. Though making it such that its only calculated once would be even better.
The second way however becomes a problem as the initial data begins to add some significant amount of time to the whole operation. I have some ideas on how to handle it if it does do it this way, but I want to avoid doing rewriting if Python doesn't continually start up new processes.