10

This (enormously simplified example) works fine (Python 2.6.6, Debian Squeeze):

from multiprocessing import Pool
import numpy as np

src=None

def process(row):
    return np.sum(src[row])

def main():
    global src
    src=np.ones((100,100))

    pool=Pool(processes=16)
    rows=pool.map(process,range(100))
    print rows

if __name__ == "__main__":
    main()

however, after years of being taught global state bad!!!, all my instincts are telling me I really really would rather be writing something closer to:

from multiprocessing import Pool
import numpy as np

def main():
    src=np.ones((100,100))

    def process(row):
        return np.sum(src[row])

    pool=Pool(processes=16)
    rows=pool.map(process,range(100))
    print rows

if __name__ == "__main__":
    main()

but of course that doesn't work (hangs up unable to pickle something).

The example here is trivial, but by the time you add multiple "process" functions, and each of those is dependent on multiple additional inputs... well it all becomes a bit reminiscent of something written in BASIC 30 years ago. Trying to use classes to at least aggregate the state with the appropriate functions seems an obvious solution, but doesn't seem to be that easy in practice.

Is there some recommended pattern or style for using multiprocessing.pool which will avoid the proliferation of global state to support each function I want to parallel map over ?

How do experienced "multiprocessing pros" deal with this ?

Update: Note that I'm actually interested in processing much bigger arrays, so variations on the above which pickle src each call/iteration aren't nearly as good as ones which fork it into the pool's worker processes.

Community
  • 1
  • 1
timday
  • 24,582
  • 12
  • 83
  • 135
  • I'm not an experienced multiprocessing pro or anything, but let me ask you why can't you simply do pool.map(process,product([src],range(100))) and change the process function to accept both variables as args? Is this highly inefficient too? – luke14free Apr 14 '12 at 09:36
  • @luke14free: Yes that'd pickle the src array over for every call, and I'm actually interested in much bigger data/arrays than those in the sample code above, so not ideal. With process pool, whatever state is set up at the point the pool is created is forked into the worker processes and available for them to read "for free". The idea would help avoid putting more minor "control variables" (e.g flags) state into globals though, thanks. – timday Apr 14 '12 at 09:48

1 Answers1

8

You could always pass a callable object like this, then the object can containe the shared state:

from multiprocessing import Pool
import numpy as np

class RowProcessor(object):
    def __init__(self, src):
        self.__src = src
    def __call__(self, row):
        return np.sum(self.__src[row])

def main():
    src=np.ones((100,100))
    p = RowProcessor(src)

    pool=Pool(processes=16)
    rows = pool.map(p, range(100))
    print rows

if __name__ == "__main__":
    main()
KillianDS
  • 16,936
  • 4
  • 61
  • 70
  • Yup works very nicely thanks; bye bye globals. Normally I'd wait longer before accepting a solution to see if anything else turns up but this is perfect. I'd tried classes for this problem before and not had any success; seems that callable makes all the difference. – timday Apr 14 '12 at 10:31
  • 2
    Wouldn't it pickle the callable and you are back to square one ? – abc def foo bar Jul 15 '15 at 23:48
  • 1
    @abc: Not if you create the callable before you create the pool. That way the callable just gets forked into the pool's worker processes (which is much cheaper - CPU TLB tricks - and more efficient than pickling it and unpickling it and creating a copy of the object in every process). It's just the callable's function arguments which get pickled. – timday Apr 15 '16 at 20:04