How to organize multiprocessing pool for collecting output in dictionary

Question

I have attempted to recreate the essence of my "real-world" problem using the small reproducible example below. This example attempts to leverage functionality I found here. The real-world example takes 16 days using a single core on my laptop, which has 16 cores, so I'm hopeful to cut my runtime down to one or two days given the majority of cores. First, however, I need to understand what I'm doing wrong with the small example below.

The example starts by setting up a list of tuples called all_combos. The idea is to then pass each tuple within all_combos to the function do_one_run(). My goal is to parallelize do_one_run() using mutliprocessing. Unfortunately the small reproducible example below kick back errors msgs that I'm unable to resolve. My suspicion is that I've misunderstood how the Pool works, in particular mapping each tuple of parameters to the arguments of do_one_run(), or perhaps I've misunderstood how to collect the output of do_one_run(), or more likely both?

Any insights very much welcome!

import random
import numpy as np
import multiprocessing as mp

slns = {}

var1 = [5, 6, 7]
var2 = [2, 3, 4]
var3 = [10, 9, 8]

all_combos = []
key = 0
for v1 in var1:
    for v2 in var2:
        for v3 in var3:
            all_combos.append([key, v1, v2, v3])
            key += 1

def example_func(v1_passed, v2_passed, v3_passed):
    tmp = np.random.random((v1_passed, v2_passed, v3_passed))*100
    my_arr = tmp.astype(int)
    piece_arr = my_arr[1,:,1:3]
    return piece_arr


def do_one_run(key, v1_passed, v2_passed, v3_passed):
    results = example_func(v1_passed, v2_passed, v3_passed)
    slns.update({key: [v1_passed, v2_passed, v3_passed, results]})

pool = mp.Pool(4)  # 4 cores devoted to job?
result = pool.starmap(do_one_run, all_combos)

You can return key-value *tuple* from `do_one_run()` (`return key, [v1_passed, v2_passed, v3_passed, results]`) and pass return of `pool.starmap()` into a `dict()` constructor. — Olvin Roght, Aug 24 '21 at 17:43
Also you don't need to from `all_combos` and define proxy function `do_one_run()`, you ca use [`itertools.product()`](https://docs.python.org/3/library/itertools.html#itertools.product) in combination with [`enumerate()`](https://docs.python.org/3/library/functions.html#enumerate) and form dict dynamically: `results = {i: v for i, v in enumerate(pool.starmap(example_func, product(var1, var2, var3)))}` — Olvin Roght, Aug 24 '21 at 18:19
@OlvinRoght Could I also trouble you for how I might get a status update? I tried implementing this post: `https://stackoverflow.com/questions/34827250/how-to-keep-track-of-status-with-multiprocessing-and-pool-map`, but they are using `apply_async`. I was unsuccessful in my attempt to do something similar with `starmap`. For the real-world problem that this small example is attempting to prototype, a status update would be a real help since it is likely to take a minimum 24 hrs even with all 16 cores on my machine — user2256085, Aug 24 '21 at 19:15
You should take a look on [`Lock`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Lock) which could allow you to modify some object and save stats from all processes safely. — Olvin Roght, Aug 25 '21 at 09:08

Corralien · Accepted Answer · 2021-08-24T18:58:19.383

2

You can't share a variable like slns through multiprocessing. You have to collect all return values from do_one_run function:

import random
import numpy as np
import multiprocessing as mp

# slns = {}  <- Remove this line

...

# Return result
def do_one_run(key, v1_passed, v2_passed, v3_passed):
    results = example_func(v1_passed, v2_passed, v3_passed)
    return key, [v1_passed, v2_passed, v3_passed, results]

if __name__ == '__main__':
    with mp.Pool(4) as pool:
        results = pool.starmap(do_one_run, all_combos)  # <- Collect results
    result = dict(itertools.chain(*map(dict.items, result))) # <- Merge results

>>> result

{0: [5,
  2,
  10,
  array([[77, 90],
         [34, 28]])],
 1: [5,
  2,
  9,
  array([[64, 43],
         [45, 53]])],
 2: [5,
  2,
  8,
  array([[ 8, 78],
         [39,  3]])],
 ...
}

edited Aug 24 '21 at 18:58

answered Aug 24 '21 at 17:45

Corralien

109,409
8
28
52

It will produce list of dictionaries. To get results as one dictionary with multiple keys you can return pair and pass result in `dict()`. – Olvin Roght Aug 24 '21 at 17:46
@OlvinRoght. I fixed it. Can you check, please? ty. – Corralien Aug 24 '21 at 17:56
You fixed it, but in overcomplicated and inefficient way. Just change your return to `return key, [v1_passed, v2_passed, v3_passed, results]` and use `result = dict(pool.starmap(do_one_run, all_combos))` – Olvin Roght Aug 24 '21 at 17:58
@Corralien @OlvinRoght thanks for getting me straightened out. If I may ask for one more insight...I can run the script as a whole from a cmd line (on Windows, running python 3.9.1) by issuing the following command: `python stckovflw_example.py`; however, when I instead paste all the lines of python script after activating python (i.e., typing `python` on the windows cmd line), it errors out. Any idea why it runs when running the script as a whole, but not when pasting all the lines of script into a cmd window after activating python? Both use the same version of python. – user2256085 Aug 24 '21 at 18:18
1

@user2256085, I don't think that using multiprocessing in command line is a good idea. You haven't provided error message, but multiprocessing rerun same script in few processes and I don't think that it will work in python console as file with code you typed doesn't exist. – Olvin Roght Aug 24 '21 at 18:23

score 0 · Answer 2 · 2021-08-24T17:51:03.600

0

Change your last 2 lines to this:-

if __name__ == '__main__':
  mp.Pool().starmap(do_one_run, all_combos)
  print('Done') # So you know when it's finished

You may also find this discussion helpful:- python multiprocessing on windows, if __name__ == "__main__"

Also note that Pool() is constructed in this example with no arguments. In that way, the underlying implementation will take best advantage of the CPU architecture upon which it's running

edited Aug 24 '21 at 17:51

answered Aug 24 '21 at 17:47

`do_one_run` doesn't return any value so `result` is `None`. – Corralien Aug 24 '21 at 17:49
You're right. I got distracted by the post from @Corralien. I will edit it appropriately – Aug 24 '21 at 17:50
Not sure about that @OlvinRoght. I made that change and it ran perfectly on my machine – Aug 24 '21 at 17:53
@OlvinRoght No. Without that check I get the ubiquitous "An attempt has been made to start a new process before the current process has finished its bootstrapping phase". – Aug 24 '21 at 17:55

How to organize multiprocessing pool for collecting output in dictionary

2 Answers2