I'm simulating a physical problem under different initial conditions using Python. As these realizations are completely independent of each other, I wanted to use the multiprocessing
package.
Now, the result of one realization can easily be a few 100 Mb large (eg. 20 frames with 2500px by 2500px), so I quickly run out of memory. Luckily, I don't care about the individual results, but only the sum (or the average) of the arrays (also of the form 20 x 2500 x 2500, or nsteps
x nres
x nres
).
However, I'm struggling to figure out how to find out when which process is done / how to access the data, add it to a results-array, and free the memory, while other realizations are still running. There must certainly be a more elegant solution than iterating over all non-finished processes with a "try"?
def simulate(): # simulates one realization
...
return pictures # np.array of shape (20x2500x2500)
if __name__ == '__main__':
result = np.zeros([nsteps, nres, nres])
pool = mp.Pool(6)
processes = [pool.apply_async(simulate) for i in range(iterations)]
result = np.sum([p.get() for p in processes], axis = 0)
I have the feeling that for the last line of my code there already exists a solution that performs exactly this, without waiting for all elements in processes
to be finished.
Please note that even though my function does not take any arguments, the return value is always unique because there is random noise added to the result in the body of the function.