1

I have a function which accepts a two inputs provided by itertools combinations, and outputs a solution. The two inputs should be stored as a tuple forming the key in the dict, while the result is the value.

I can pool this and get all of the results as a list, which I can then insert into a dictionary one-by-one, but this seems inefficient. Is there a way to get the results as each job finishes, and directly add it to the dict?

Essentially, I have the code below:

all_solutions = {}
for start, goal in itertools.combinations(graph, 2):
    all_solutions[(start, goal)] = search(graph, start, goal)

I am trying to parallelize it as follows:

all_solutions = {}
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
results = pool.starmap(search, zip(itertools.repeat(graph_pool),
                                   itertools.combinations(graph, 2)))
for i, start_goal in enumerate(itertools.combinations(graph, 2)):
    start, goal = start_goal[0], start_goal[1]
    all_solutions[(start, goal)] = results[i]

Which actually works, but iterates twice, once in the pool, and once to write to a dict (not to mention the clunky tuple unpacking).

Teknophilia
  • 758
  • 10
  • 23
  • 1
    FYI, no klunky tuple unpacking needed. You just need more parens: `for i, (start, goal) in enumerate(itertools.combinations(graph, 2)):` unpacks directly into `start` and `goal` (the parens are *not* optional in this case, because you need them to make it clear it's a nested unpacking going on). – ShadowRanger Feb 14 '18 at 01:36
  • @ShadowRanger, ah that's great to know; thanks! – Teknophilia Feb 14 '18 at 01:39

1 Answers1

2

This is possible, you just need to switch to using a lazy mapping function (not map or starmap, which have to finish computing all the results before you can begin using any of them):

from functools import partial
from itertools import tee

manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()

# Since you're processing in order and in parallel, tee might help a little
# by only generating the dict keys/search arguments once. That said, 
# combinations of n choose 2 are fairly cheap; the overhead of tee's caching
# might overwhelm the cost of just generating the combinations twice
startgoals1, startgoals2 = tee(itertools.combinations(graph, 2))

# Use partial binding of search with graph_pool to be able to use imap
# without a wrapper function; using imap lets us consume results as they become
# available, so the tee-ed generators don't store too many temporaries
results = pool.imap(partial(search, graph_pool), startgoals2))

# Efficiently create the dict from the start/goal pairs and the results of the search
# This line is eager, so it won't complete until all the results are generated, but
# it will be consuming the results as they become available in parallel with
# calculating the results
all_solutions = dict(zip(startgoals1, results))
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Wow, that's exactly it. "partial" is exactly what I was looking for, thank you! – Teknophilia Feb 14 '18 at 01:57
  • @Teknophilia: Yar. A warning: `partial` will be slow if the bound arguments are expensive to `pickle`. In this case, it's a managed `dict` intended for `multiprocessing`, so the pickling cost is just the information necessary to reconnect to the manager (fairly cheap), not copying all the data, but if the arguments are big/expensive to pickle, [`partial` is not the way to go](https://stackoverflow.com/q/35062087/364696). [My answer to that question](https://stackoverflow.com/a/35062539/364696) explains why a global wrapper function with default arguments can be a better idea in such a case. – ShadowRanger Feb 14 '18 at 02:02
  • thanks for the heads-up. The managed dict itself is rather large, but if I understand correctly, it's just being reconnected - not duplicated - so it shouldn't be a problem. I'm reading up on this and your answer is a great help in understanding. – Teknophilia Feb 14 '18 at 02:05