Embarrassingly parallel for loop with complex outputs in each iteration

Question

I have an embarrassingly parallel for loop in python (to be repeated n times), each iteration performing a complex task and returning a mix of numpy arrays and dict (so not a single number to filled into an array - otherwise think of them as complex bunch for now). The repetitions don't need to be in any particular order - I just need to be able to identify each i of the n iterations uniquely (e.g. to save results within repetition independently). In fact they don't even need to be identified by an index/counter, but an unique something as they don't need to be ordered (I can easily fill them back into a bigger array.)

To give a more concrete example, I would like parallelize the following task:

def do_complex_task(complex_input1, input2, input3, input_n):
  "all important computation done here - independent of i or n"

  inner_result1, inner_result2 = np.zeros(100), np.zeros(100)
  for smaller_input in complex_input1:
    inner_result1 = do_another_complex_task(smaller_input, input2, input3, input_n)
    inner_result2 = do_second_complex_task(smaller_input, input2, input3, input_n)

  # do some more to produce few more essential results
  dict_result = blah()

  unique_identifier = get_unique_identifier_for_this_thread() # I don't know how

  # save results for each repetition independently before returning, 
  # instead of waiting for full computation to be done which can take a while
  out_path = os.path.join(out_dir, 'repetition_{}.pkl'.format(unique_identifier))

  return inner_result1, inner_result2, inner_result_n, dict_result


def main_compute()
  "main method to run the loop"

  n = 256 # ideally any number, but multiples of 4 possible, for even parallelization.

  result1  = np.zeros([n, 100])
  result2  = np.zeros([n, 100])
  result_n = np.zeros([n, 100])
  dict_result = list()

  # this for loop does not need to be computed in any order (range(n) is an illustration)
  # although this order would be ideal, as it makes it easy to populate results into a bigger array
  for i in range(n):
    # this computation has nothing to do with i or n!
    result1[i, :], result2[i, :], result_n[i, :], dict_result[i] = do_complex_task(complex_input1, input2, input3, input_n)

  # I need to parallelize the above loop to speed up stupidly parallel processing.


if __name__ == '__main__':
    pass

I've read reasonably widely and it is not clear which strategy would be smarter and easiest, without any reliability issues.

Also complex_input1 can be large - so I'd not prefer lot of I/O overhead with pickling.

I can certainly return a single list (with all the complex parts), which gets appended to a master list, which can later on be assembled into the format I like (rectangular arrays etc). This can be done easily with joblib for example. However, I am trying to learn from you all to identify good solutions.

EDIT: I think I am settling on the following solution. Let me know what could go wrong with it or how can I improve it further in terms of speed, no side effects etc. After few unstructured trials on my laptop, it is not clear if there is clear speedup due to this.

from multiprocessing import Pool, Manager
chunk_size = int(np.ceil(num_repetitions/num_procs))
with Manager() as proxy_manager:
    shared_inputs = proxy_manager.list([complex_input1, input2, another, blah])
    partial_func_holdout = partial(key_func_doing_work, *shared_inputs)

    with Pool(processes=num_procs) as pool:
        results = pool.map(partial_func_holdout, range(num_repetitions), chunk_size)

Rereading your question, it seems you don't want to use an array (list) because your result type is complicated. Don't worry about that. Put them in a list anyway. It doesn't matter how complicated the result type is, as long as it's a single object. It will be easier if you clearly define the result class, though, so it is more clear that it is one single result (with a lot of parts). — Kenny Ostrom, Sep 24 '17 at 15:09
Thanks Kenny. I agree with you on " It doesn't matter how complicated the result type is, as long as it's a single object." That's certainly an option I don't mind implementing - just trying to see what other python experts and users can suggest - radically different solutions I can't think of. — Pradeep Reddy Raamana, Sep 24 '17 at 15:40
Also, I would rather not define a new `class` just to put few arrays together. Let's see what others suggest here. — Pradeep Reddy Raamana, Sep 24 '17 at 15:41

Mark · Answer 1 · 2017-09-25T04:29:40.737

There's a built-in solution for this in the form of multiprocessing.Pool.map

import multiprocessing
from functools import partial

def do_task(a, b):
    return (42, {'x': a * 2, 'y': b[::-1]})

if __name__ == '__main__':
    a_values = ['Hello', 'World']
    with multiprocessing.Pool(processes=3) as pool:
        results = pool.map(partial(do_task, b='fixed b value'), a_values)
    print(results)

After this, results will contain the results in the same order as a_values.

The requirement is that the arguments and return values are Pickle'able. Except for that they can be complicated, although if it's a lot of data there may be some performance penalty.

I don't know if this is what you consider a good solution; I've used it many times and it works great for me.

You can put the return values in a class, but personally I feel that doesn't really offer benefits since Python doesn't have static type checking.

It just starts up to #processes jobs in parallel. They should be independent and the order doesn't matter (I think they're started in the provided order, but they may be complete in another order).

Example based on this answer.

Thank you Mark, appreciate your effort in helping me. Could comment on what is `names` is and why `a_values` are not being used? — Pradeep Reddy Raamana, Sep 24 '17 at 16:15
Also as the order is not important for me, is there a way to improve performance still, given parallel tasks can be fired up at random and processed independently. — Pradeep Reddy Raamana, Sep 24 '17 at 16:16
Sorry, I forgot to change the names thing, fixed now. This assumes 3 jobs can run in parallel independently and in any order. Unless you know how long jobs take and do some advanced scheduling, this already takes full advantage of parallelism (of course you can increase the number of processes if you have more cores or the tasks don't take 100% cpu). — Mark, Sep 25 '17 at 03:14

Embarrassingly parallel for loop with complex outputs in each iteration

1 Answers1

Linked