I am working on a problem that allows for some rather unproblematic parallelisation. I am having difficulties figuring out what suitable. parallelising mechanisms are available in Python. I am working with python 3.9 on MacOS.
My pipeline is:
get_common_input()
acquires some data in a way not easily parallelisable. If that matters, its return valuecommon_input_1
a list of list of integers.parallel_computation_1()
gets thecommon_input_1
and an individual input from a listindividual_inputs
. The common input is only read.common_input_2
is more or less the collected outputs from parallel_computation_1()`.parallel_computation_2()
then again getscommon_input_2
as read only input, plus some individual input.
I could do the following:
import multiprocessing
common_input_1 = None
common_input_2 = None
def parallel_computation_1(individual_input):
return sum(1 for i in common_input_1 if i == individual_input)
def parallel_computation_2(individual_input):
return individual_input in common_input_2
def main():
multiprocessing.set_start_method('fork')
global common_input_1
global common_input_2
common_input_1 = [1, 2, 3, 1, 1, 3, 1]
individual_inputs_1 = [0,1,2,3]
individual_inputs_2 = [0,1,2,3,4]
with multiprocessing.Pool() as pool:
common_input_2 = pool.map(parallel_computation_1, individual_inputs_1)
with multiprocessing.Pool() as pool:
common_output = pool.map(parallel_computation_2, individual_inputs_2)
print(common_output)
if __name__ == '__main__':
main()
As suggested in this answer, I use global variables to share the data. That works if I use set_start_method('fork')
(which works for me, but seems to be problematic on MacOS).
Note that if I remove the second with multiprocessing.Pool()
to have just one Pool used for both parallel tasks, things won't work (the processes don't see the new value of common_input_2
).
Apart from the fact that using global variables seems like bad coding style to me (Is it? That's just my gut feeling), the need to start a new pool doesn't please me, as it introduces some probably unnecessary overhead.
What do you think about these concerns, esp. the second one?
Are there good alternatives? I see that I could use multiprocessing.Array
, but since my data are a lists of lists, I would need to flatten it into a single list and use that in parallel_computation
in some nontrivial way. If my shared input was even more complex, I would have to put quite some effort into wrapping this into multiprocessing.Value
or multiprocessing.Array
's.