1

I am a former Matlab-only user who is self-learning Python. I am now at the point where I am learning how to use Python's parallel processing. Matlab's parallel programming is performed, for the most part, primarily with a single command: parfor. Simple.

There seems to be many more options available in Python than in Matlab. It is overwhelming. For instance, in the multiprocessing package, I find the Pool and the Process classes, and each allows for several arguments and options.

Given that I am new to python, and that I am self-teaching myself, would someone take a few minutes to explain the big-picture difference between Pool and Process? When is it more appropriate to use Pool vs. Process?

For context: my current programming work requires me to parallelize a function that outputs a vector of data. This function accepts several arguments, but is essentially a for-loop that I would like to run in parallel.

Many thanks for your help!

matt1011
  • 11
  • 3

1 Answers1

0

The Process class of multiprocessing module is intended to run a Python callable (function or class instance with implemented __call__ method/protocol) in separate Python process, thus executing the callable in parallel. Example:

import time
import multiprocessing


def stall(secs: int):
    time.sleep(secs)
    print('Slept for', secs, 'seconds')


if __name__ == '__main__':
    # create process which will be executed in parallel
    proc = multiprocessing.Process(target=stall, args=(2,))

    # start parallel process execution
    proc.start()

    # execute code in main process...
    print('I am in main process')

    # wait for parallel executed process to finish
    proc.join()

The Pool represents a collection of such workers/processes. The number of workers in the pool is set by the processes argument of class constructor (by default it is equal to the value returned by os.cpu_count()). So if you want to execute different functions or same function with different arguments in parallel you can use the pool of workers instead of creating the processes manually. Example:

import time
import multiprocessing


def stall(secs: int):
    time.sleep(secs)
    print('Slept for', secs, 'seconds')


if __name__ == '__main__':
    # create pool of parallel executed workers
    pool = multiprocessing.Pool()

    # call the 'stall' function in parallel with different arguments
    for secs in (3, 2, 4, 1):
        pool.apply_async(stall, (secs,))

    # or just apply parallel version of the 'map' function
    # pool.map_async(stall, (3, 2, 4, 1))

    # close the pool of workers
    pool.close()

    # execute code in main process...
    print('I am in main process')

    # wait for all processes in the pool to finish
    pool.join()

As stated in documentation for the Pool:

It supports asynchronous results with timeouts and callbacks and has a parallel map implementation.

Hope this will be helpful.

  • Thanks, Vladimir. I am still not completely understanding: (1) In your Process example, wouldn't the 'stall' function only be implemented once, given that it does not contain a for-loop and args only consists of one value. So, this is a serial example, right? Were one to add more arguments to args, how would this differ from your Pool example? (2) In your Pool example, what does 'pool.map_async' do? – matt1011 Sep 09 '19 at 14:39
  • The example with Process class shows how to call a function in parallel, without blocking main process. At the point of calling `proc.start()` the Python interpreter "creates" separate process and runs `stall(2)` in it. The `pool.map_async` does the same without explicit creation of Process instance. As shown in example you can call the same function with different arguments in parallel – Vladimir Poghosyan Sep 09 '19 at 15:26
  • Thanks for that. Another question: in both the Pool and Process examples, does / can 'proc' and 'pool' return information? I realize that the 'stall' function only prints timing information. However, if the 'stall' function was written to create and output a vector array of numerical timing information, could this array be recovered after 'stall' is run in parallel? That is, is there a way to get this data via proc.*something* or pool.*something*? This is related to how I am writing my code in parallel. – matt1011 Sep 09 '19 at 18:36
  • The `apply_async` and `map_async` methods of Pool class provide argument for callback function, which is called when result becomes ready. This thread may be useful for reading data from forked process https://stackoverflow.com/questions/10415028/how-can-i-recover-the-return-value-of-a-function-passed-to-multiprocessing-proce . The thread describes usage of Manager class and you can use also shared queues for that https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue – Vladimir Poghosyan Sep 09 '19 at 19:37
  • Thank you for this. Two more follow up questions: does this mean that the Proc class does not allow for callback functions (e.g., only the Pool class)? Also, what is a "forked process"? – matt1011 Sep 10 '19 at 12:28
  • You are right, the Process class does not support callbacks. You must use Queues or Managers for synchronization (getting data from the separate process). "Forked process" is the subprocess (spinoff, child process) of the main process. – Vladimir Poghosyan Sep 10 '19 at 12:39
  • Thank you for this, Vladimir. Very helpful and appreciated – matt1011 Sep 11 '19 at 13:46