How to reuse a process pool for parallel programming in Python 3

Question

I'm new to parrallel programming. My task is to analyze hundreds of data files. Each of those data is nearly 300MB, and could be sliced into numerous slices. My computer is a 4-core pc. And I want to get the result of each data as soon as possible.
The analysis of each data file consists of 2 procedures. First, read data into memory, and then slice it into slices, which is io intensive work. Then, do lots of computation for the slices of this file, which is cpu intensive.
So my strategy is group this files in group of 4. For each group of these files, first, read all data of 4 files into memory with 4 processes in 4 cores. The code is like,

with Pool(processes=4) as pool:
    data_list = pool.map(read_and_slice, files)  # len(files)==4

Then for each data in data_list, do computation work with 4 processes.

for data in data_list:  # I want to get the result of each data asap
    with Pool(processes=4) as pool:
        result_list = pool.map(compute, data.slices)  # anaylyze each slice of data
    analyze(result_list)  # analyze the results of previous procedure, for example, get the average.

And then go for another group.
So the problem is that during the whole process of computation of hundreds of files, the pool is recreated many times. How could I avoid the overhead of recreating pools and processes? Is there any substantial memory overhead in my code? And is there a better way for me to make the time needed as less as possible?

Thanks!

score 4 · Accepted Answer · edited May 23 '17 at 12:31

One option is to move the with Pool statement outside of the for loop…

p = Pool()
for data in data_list:
  result_list = pool.map(compute, data.slices)
  analyze(result_list)
p.join()
p.close()

This works in python 2 or 3.

If you install (my module) pathos, and then do from pathos.pools import ProcessPool as Pool, and keep the rest of the code exactly as you have it -- you will only create one Pool. This is because pathos caches the Pool, and when a new Pool instance is created that has the same configuration, it just reuses the existing instance. You can do a pool.terminate() to close it.

>>> from pathos.pools import ProcessPool as Pool
>>> pool = Pool()
>>> data_list = [range(4), range(4,8), range(8,12), range(12,16)]
>>> squared = lambda x:x**2
>>> mean = lambda x: sum(x)/len(x)
>>> for data in data_list:
...   result = pool.map(squared, data)
...   print mean(result)
... 
3
31
91
183

Actually, pathos enables you to do nested pools, so you could also convert your for loop into a asynchronous map (amap from pathos)… and since the inner map doesn't need to preserve order you could use a unordered map iterator (imap_unordered in multiprocessing, or uimap from pathos). For examples, see here: https://stackoverflow.com/questions/28203774/how-to-do-hierarchical-parallelism-in-ipython-parallel and here: https://stackoverflow.com/a/31617653/2379433

Only bummer is pathos is python2. But will soon (pending release) will be fully converted to python3.

Thank you! I will try this. – ProtossShuttle Jan 02 '16 at 01:43 — ProtossShuttle, Jan 02 '16 at 01:43

How to reuse a process pool for parallel programming in Python 3

1 Answers1