1

I'm wondering how Python's multiprocessing module starts up new processes. Here's an example:

import multiprocessing as mp

print 'start'
initial_data = initial_calculations(seed)


def child(data):
    "called by mp.Pool"
    process data with initial_data into results and return (initial_data does not change)

def setup():
    "called by main process"
    generate and return data on how to process initial_data


def final(results):
    "called by main process"
    assemble final results and write to log file


if __name__ == '__main__':

    cpus = mp.cpu_count()

    data = setup()

    p = mp.Pool(cpus)

    try:
        results=p.map(child, data)
        p.close()
        p.join()

    except:
        p.terminate()
        p.join()
        raise

    finalprocess(results)
    print 'yay!'

I gather that unix forks this code and so initial_data only calculated once and is shared by all subprocesses started by pool. On windows however, it starts up a new subprocess meaning initial_data is recalculated in each. But how exactly? This is what I'm wondering about. For the example code above, initial_data only takes a few seconds whereas child can take many hours. Data would be a list of arguments to work on that is far longer than the number of suprocesses that can be started.

Say that cpus = 8. Does pool create 8 versions of this, recalculate the initial_data and then run child repeatedly until data is exhausted or does it keep creating new versions of this code(only up to 8 at a time) until the data is exhausted? That is, does pool do something like:

chunk data into equal amounts for each process (subdata)
start each process (printing start and recalculating initial_data)
then in each process
 for i in subdata:
     child(i)
 return results

or does it do something like:

for i in data:
    start process (printing start and recalculating initial_data)
    child(i)
    if numprocesses > 8: wait
return results

?

The first way I would be fine with, as the initial data is calculated only 8 times but the amount of time child takes far outweighs this. My code is currently set up thinking it is in this manner. Though making it such that its only calculated once would be even better.

The second way however becomes a problem as the initial data begins to add some significant amount of time to the whole operation. I have some ideas on how to handle it if it does do it this way, but I want to avoid doing rewriting if Python doesn't continually start up new processes.

Status
  • 912
  • 1
  • 12
  • 23
  • Seems like you could answer this yourself with a few `print` statements or calls. – martineau Sep 20 '15 at 16:11
  • @martineau, I can't because subprocesses don't print to stdout. I have no idea where they go. – Status Sep 20 '15 at 16:19
  • to avoid the guess work, don't use globals in the main process: write the explicit initializer (pass it to Pool constructor) and inherit the necessary data instead. You could use `mp.set_start_method('spawn')` even on Unix (for testing). – jfs Sep 20 '15 at 16:21

1 Answers1

0

Spent way too much time thinking about the question and whether I should write my code one way or the other that I didn't even think of just making up some code to see which way is true. martineau's comment made me slap my forehead after realizing I should generate files instead of using print statements(which don't go to stdout for the subprocesses).

Below does this. It generates a file every time it's run. If map starts up the file every time it calls a function, this would generate thousands of files. But instead if it just starts up the code once, then loops calling a function, it'll only generate a few files.

As such, it only generates a few files and therefore only starts up the code the fewest times needed.

import multiprocessing as mp
import os

def gen_fil():
    folder = 'C:\mptest'
    if not os.path.exists(folder):os.makedirs(folder)
    name = os.path.join(folder,'000.txt')
    i=0
    while os.path.exists(name):
        i+=1
        name=os.path.join(folder,'{:03d}.txt'.format(i))
    h = open(name, 'a', 0)
    h.close()

gen_fil()


def worker(n):
    return n*n


if __name__=='__main__':

    cpus = mp.cpu_count()

    p = mp.Pool(cpus)
    data = range(int(5e7))
    print 'start'
    print 'num cpus', cpus
    try:
        results = p.map(worker, data)
        p.close()
        p.join()
    except:
        p.terminate()
        p.join()
        raise
    print len(results)
    print 'end'
Status
  • 912
  • 1
  • 12
  • 23