0

I am working on a problem that allows for some rather unproblematic parallelisation. I am having difficulties figuring out what suitable. parallelising mechanisms are available in Python. I am working with python 3.9 on MacOS.

My pipeline is:

  • get_common_input() acquires some data in a way not easily parallelisable. If that matters, its return value common_input_1 a list of list of integers.
  • parallel_computation_1() gets the common_input_1 and an individual input from a list individual_inputs. The common input is only read.
  • common_input_2 is more or less the collected outputs from parallel_computation_1()`.
  • parallel_computation_2() then again gets common_input_2 as read only input, plus some individual input.

I could do the following:

import multiprocessing
common_input_1 = None
common_input_2 = None

def parallel_computation_1(individual_input):
    return sum(1 for i in common_input_1 if i == individual_input)

def parallel_computation_2(individual_input):
    return individual_input in common_input_2

def main():
    multiprocessing.set_start_method('fork')
    global common_input_1
    global common_input_2
    common_input_1      = [1, 2, 3, 1, 1, 3, 1]
    individual_inputs_1 = [0,1,2,3]
    individual_inputs_2 = [0,1,2,3,4]
    with multiprocessing.Pool() as pool:
        common_input_2 = pool.map(parallel_computation_1, individual_inputs_1)
    with multiprocessing.Pool() as pool:
        common_output = pool.map(parallel_computation_2, individual_inputs_2)
    print(common_output)

if __name__ == '__main__':
    main()

As suggested in this answer, I use global variables to share the data. That works if I use set_start_method('fork') (which works for me, but seems to be problematic on MacOS).

Note that if I remove the second with multiprocessing.Pool() to have just one Pool used for both parallel tasks, things won't work (the processes don't see the new value of common_input_2).

Apart from the fact that using global variables seems like bad coding style to me (Is it? That's just my gut feeling), the need to start a new pool doesn't please me, as it introduces some probably unnecessary overhead.

What do you think about these concerns, esp. the second one?

Are there good alternatives? I see that I could use multiprocessing.Array, but since my data are a lists of lists, I would need to flatten it into a single list and use that in parallel_computation in some nontrivial way. If my shared input was even more complex, I would have to put quite some effort into wrapping this into multiprocessing.Value or multiprocessing.Array's.

Bubaya
  • 615
  • 3
  • 13
  • I assume you are using a multi-process approach because the computation is CPU-intensive, so the time to create a process pool should be negligible compared to that. – Ionut Ticus Nov 13 '21 at 09:53
  • Regarding the global variables: they can make the code hard to follow if you have many functions that modify them (especially in large projects); in your case you're not modifying the state so it shouldn't be an issue. – Ionut Ticus Nov 13 '21 at 10:05
  • @IonutTicus But am I right in suspecting that reading from global variables is rather slow? – Bubaya Nov 16 '21 at 09:14
  • It's true that accessing a global variable is slower than accessing a local variable because of their priorities but it's still negligible even if you access it thousands of times; you can create a local reference (preferably to the portion of the data you're going to use) to alleviate some of the overhead. – Ionut Ticus Nov 16 '21 at 10:06

1 Answers1

0

You can define and compute output_1 as a global variable before creating your process pool; that way each process will have access to the data; this won't result in any memory duplication because you're not changing that data (copy-on-write).

_output_1 = serial_computation()


def parallel_computation(input_2):
    # here you can access _output_1
    # you must not modify it as this will result in creating new copy in the child process
    ...


def main():
    input_2 = ...
    with Pool() as pool:
        output_2 = pool.map(parallel_computation, input_2)
Ionut Ticus
  • 2,683
  • 2
  • 17
  • 25
  • I see, this seems to work (if I use fork as start method; should've said that I work on MacOS, but so far, that works). What do I do if I have parallel computations -> common output -> another parallel computation? I would have to assign the common output to a global variable and start a new pool, don't I? seems that I cannot reuse the same pool, b/c if the global variable is said in the master process after the pool is launched, it is copied as well. Any possibility that allows launching a pool only once? – Bubaya Nov 11 '21 at 11:17
  • Thanks for your answer btw. I've reworked my question, incorporating your suggestion. – Bubaya Nov 11 '21 at 11:40