Here are two approaches you can take. What's wise also depends on where the bottleneck is, which is something that can best be measured rather than guessed.
The ideal option would be to leave all low level optimization to Numpy. Right now you have a mix of native Python code and Numpy code. The latter doesn't play well with loops. They work, of course, but by having loops in Python, you force operations to take place sequentially in the order you specified. It's better to give Numpy operations that it can perform on as many elements at once as possible, i.e. matrix transformations. That benefits performance, not only because of automatic (partial) parallelization; even single threads will be able to get more out of the CPU. A highly recommended read to learn more about this is From Python to Numpy.
If you do need to parallelize pure Python code, you have few options but to go with multiple processes. For that, refer to the multiprocessing
module. Rearrange the code into three steps:
- Preparing the inputs for every job
- Dividing those jobs between a pool of workers to be run in parallel (fork/map)
- Collecting the results (join/reduce)
You need to strike a balance between enough processes to make parallelizing worthwhile, and not so many that they will be too short-lived. The cost of spinning up processes and communicating with them would then become significant by itself.
A simple solution would be to generate a list of (i,j)
pairs, so that there will nx*ny
jobs. Then make a function that takes such pair as input and returns a list of (i,j,k,p,meanval)
. Try to only use the inputs to the function and return a result. Everything local; no side effects et cetera. Read-only access to globals such as myList1
is okay, but modification requires special measures as described in the documentation. Pass the function and the list of inputs to a worker pool. Once it has finished producing partial results, combine all those into your store
.
Here's an example:
from multiprocessing import Pool
import numpy as np
# Global variables are OK, as long as their contents are not modified, although
# these might just as well be moved into the worker function or an initializer
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
def calc_meanvals_for(pair):
"""Process a reasonably sized chunk of the problem"""
i, j = pair
f = calc(value1[i], value2[j])
results = []
for k, data1 in enumerate(myList1):
for p, data2 in enumerate(myList2):
meanval = np.sum(f[:]/data1)*data2
results.append((i,j,k,p,meanval))
return results
# This module will be imported by every worker - that's how they will be able
# to find the global variables and the calc function - so make sure to check
# if this the main program, because without that, every worker will start more
# workers, each of which will start even more, and so on, in an endless loop
if __name__ == '__main__':
# Create a pool of worker processes, each able to use a CPU core
pool = Pool()
# Prepare the arguments, one per function invocation (tuples to fake multiple)
arg_pairs = [(i,j) for i in range(nx) for j in range(ny)]
# Now comes the parallel step: given a function and a list of arguments,
# have a worker invoke that function with one argument until all arguments
# have been used, collecting the return values in a list
return_values = pool.map(calc_meanvals_for, arg_pairs)
# Since the function also returns a list, there's now a list of lists - consider
# itertools.chain.from_iterable to flatten them - to be processed further
store = np.zeros(nx, ny, len(myList1), len(myList2))
for results in return_values:
for i, j, k, p, meanval in results:
store[i,j,k,p] = meanval