I have an embarrassingly parallel for loop in python (to be repeated n
times), each iteration performing a complex task and returning a mix of numpy arrays and dict (so not a single number to filled into an array - otherwise think of them as complex bunch for now). The repetitions don't need to be in any particular order - I just need to be able to identify each i
of the n
iterations uniquely (e.g. to save results within repetition independently). In fact they don't even need to be identified by an index/counter, but an unique something as they don't need to be ordered (I can easily fill them back into a bigger array.)
To give a more concrete example, I would like parallelize the following task:
def do_complex_task(complex_input1, input2, input3, input_n):
"all important computation done here - independent of i or n"
inner_result1, inner_result2 = np.zeros(100), np.zeros(100)
for smaller_input in complex_input1:
inner_result1 = do_another_complex_task(smaller_input, input2, input3, input_n)
inner_result2 = do_second_complex_task(smaller_input, input2, input3, input_n)
# do some more to produce few more essential results
dict_result = blah()
unique_identifier = get_unique_identifier_for_this_thread() # I don't know how
# save results for each repetition independently before returning,
# instead of waiting for full computation to be done which can take a while
out_path = os.path.join(out_dir, 'repetition_{}.pkl'.format(unique_identifier))
return inner_result1, inner_result2, inner_result_n, dict_result
def main_compute()
"main method to run the loop"
n = 256 # ideally any number, but multiples of 4 possible, for even parallelization.
result1 = np.zeros([n, 100])
result2 = np.zeros([n, 100])
result_n = np.zeros([n, 100])
dict_result = list()
# this for loop does not need to be computed in any order (range(n) is an illustration)
# although this order would be ideal, as it makes it easy to populate results into a bigger array
for i in range(n):
# this computation has nothing to do with i or n!
result1[i, :], result2[i, :], result_n[i, :], dict_result[i] = do_complex_task(complex_input1, input2, input3, input_n)
# I need to parallelize the above loop to speed up stupidly parallel processing.
if __name__ == '__main__':
pass
I've read reasonably widely and it is not clear which strategy would be smarter and easiest, without any reliability issues.
Also complex_input1
can be large - so I'd not prefer lot of I/O overhead with pickling.
I can certainly return a single list (with all the complex parts), which gets appended to a master list, which can later on be assembled into the format I like (rectangular arrays etc). This can be done easily with joblib for example. However, I am trying to learn from you all to identify good solutions.
EDIT: I think I am settling on the following solution. Let me know what could go wrong with it or how can I improve it further in terms of speed, no side effects etc. After few unstructured trials on my laptop, it is not clear if there is clear speedup due to this.
from multiprocessing import Pool, Manager
chunk_size = int(np.ceil(num_repetitions/num_procs))
with Manager() as proxy_manager:
shared_inputs = proxy_manager.list([complex_input1, input2, another, blah])
partial_func_holdout = partial(key_func_doing_work, *shared_inputs)
with Pool(processes=num_procs) as pool:
results = pool.map(partial_func_holdout, range(num_repetitions), chunk_size)