Parallelizing a nested Python for loop

Question

What type of parallel Python approach would be suited to efficiently spreading the CPU bound workload shown below. Is it feasible to parallelize the section? It looks like there is not much tight coupling between the loop iterations i.e. portions of the loop could be handled in parallel so long as an appropriate communication to reconstruct the store variable is done at the end. I'm currently using Python2.7, but if a strong case could be made that this problem can be easily handled in a newer version, then I will consider migrating the code base.

I have tried to capture the spirit of the computation with the example below. I believe that it has the same connectedness between the loops/variables as my actual code.

nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
store = np.zeros(nx,ny,len(myList1),len(myList2))
for i in range(nx):
  for j in range(ny):
    f = calc(value1[i],value2[j])  #returns a list
    for k,data1 in enumerate(myList1):
      for p,data2 in enumerate(myList2):
        meanval = np.sum(f[:]/data1)*data2
        store[i,j,k,p] = meanval

If you have python or numpy code without or only little interaction with other modules, i would recommend numba. Here is an example and a comparison to standard parallelizing using multiprocessing. https://stackoverflow.com/a/45359693/4045774 — max9111, Aug 25 '17 at 20:57

Thijs van Dien · Accepted Answer · 2021-03-05T02:42:54.193

Here are two approaches you can take. What's wise also depends on where the bottleneck is, which is something that can best be measured rather than guessed.

The ideal option would be to leave all low level optimization to Numpy. Right now you have a mix of native Python code and Numpy code. The latter doesn't play well with loops. They work, of course, but by having loops in Python, you force operations to take place sequentially in the order you specified. It's better to give Numpy operations that it can perform on as many elements at once as possible, i.e. matrix transformations. That benefits performance, not only because of automatic (partial) parallelization; even single threads will be able to get more out of the CPU. A highly recommended read to learn more about this is From Python to Numpy.

If you do need to parallelize pure Python code, you have few options but to go with multiple processes. For that, refer to the multiprocessing module. Rearrange the code into three steps:

Preparing the inputs for every job
Dividing those jobs between a pool of workers to be run in parallel (fork/map)
Collecting the results (join/reduce)

You need to strike a balance between enough processes to make parallelizing worthwhile, and not so many that they will be too short-lived. The cost of spinning up processes and communicating with them would then become significant by itself.

A simple solution would be to generate a list of (i,j) pairs, so that there will nx*ny jobs. Then make a function that takes such pair as input and returns a list of (i,j,k,p,meanval). Try to only use the inputs to the function and return a result. Everything local; no side effects et cetera. Read-only access to globals such as myList1 is okay, but modification requires special measures as described in the documentation. Pass the function and the list of inputs to a worker pool. Once it has finished producing partial results, combine all those into your store.

Here's an example:

from multiprocessing import Pool
import numpy as np

# Global variables are OK, as long as their contents are not modified, although
# these might just as well be moved into the worker function or an initializer
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)

def calc_meanvals_for(pair):
    """Process a reasonably sized chunk of the problem"""
    i, j = pair
    f = calc(value1[i], value2[j])
    results = []
    for k, data1 in enumerate(myList1):
        for p, data2 in enumerate(myList2):
            meanval = np.sum(f[:]/data1)*data2
            results.append((i,j,k,p,meanval))
    return results

# This module will be imported by every worker - that's how they will be able
# to find the global variables and the calc function - so make sure to check
# if this the main program, because without that, every worker will start more
# workers, each of which will start even more, and so on, in an endless loop
if __name__ == '__main__':
    # Create a pool of worker processes, each able to use a CPU core
    pool = Pool()
    # Prepare the arguments, one per function invocation (tuples to fake multiple)
    arg_pairs = [(i,j) for i in range(nx) for j in range(ny)]
    # Now comes the parallel step: given a function and a list of arguments,
    # have a worker invoke that function with one argument until all arguments
    # have been used, collecting the return values in a list
    return_values = pool.map(calc_meanvals_for, arg_pairs)
    # Since the function also returns a list, there's now a list of lists - consider
    # itertools.chain.from_iterable to flatten them - to be processed further
    store = np.zeros(nx, ny, len(myList1), len(myList2))
    for results in return_values:
        for i, j, k, p, meanval in results:
            store[i,j,k,p] = meanval

In terms of the multiprocessing, can the multiprocessing be started from a serially executed script i.e. The code is serial up to the part in my question, then processes are spawned to run the loops, then after they finish I collect the data from them and continue on with the serial script? — wandadars, Aug 22 '17 at 16:43
@wandadars Yes, out of my three listed steps, only the middle one will use multiple processes, and it's initiated from the same script as the other steps. Just read that module's documentation and look for examples of using a `Pool`. — Thijs van Dien, Aug 22 '17 at 17:32
This list of (i,j) pairs, does it need to be a list of tuples (i,j)? And the same for the function that returns a list of (i,j,k,p,meanval)? I'm having trouble translating the official multiprocessing documentation into specific knowledge about my particular problem. — wandadars, Aug 24 '17 at 20:59
@wandadars I would give you an example, but it's tricky, because your own example is a simplification of your actual code. Seemingly minor details can be very important when it comes to concurrency. Can I safely assume that only the `store` is being modified and all other variables not local to a loop are defined once and from then on only read? — Thijs van Dien, Aug 25 '17 at 00:32
Yes. Store is the only variable that in the end should be modified. All other variables going into the loop are read only. And 'f' is just a temporary variable used to store a calculation. — wandadars, Aug 25 '17 at 16:17
Thank you @Thijs van Dien. I actually tried to write an example on my own, and it looks almost exactly like yours haha. Your work is greatly appreciated. — wandadars, Aug 25 '17 at 21:18

Parallelizing a nested Python for loop

1 Answers1

Linked