Multiprocessing in python - saving memory by adding results

Question

I'm simulating a physical problem under different initial conditions using Python. As these realizations are completely independent of each other, I wanted to use the multiprocessing package.

Now, the result of one realization can easily be a few 100 Mb large (eg. 20 frames with 2500px by 2500px), so I quickly run out of memory. Luckily, I don't care about the individual results, but only the sum (or the average) of the arrays (also of the form 20 x 2500 x 2500, or nsteps x nres x nres).

However, I'm struggling to figure out how to find out when which process is done / how to access the data, add it to a results-array, and free the memory, while other realizations are still running. There must certainly be a more elegant solution than iterating over all non-finished processes with a "try"?

def simulate():     # simulates one realization
    ...
    return pictures # np.array of shape (20x2500x2500)


if __name__ == '__main__':

    result = np.zeros([nsteps, nres, nres])
    
    pool = mp.Pool(6)
    processes = [pool.apply_async(simulate) for i in range(iterations)]

    result = np.sum([p.get() for p in processes], axis = 0)

I have the feeling that for the last line of my code there already exists a solution that performs exactly this, without waiting for all elements in processes to be finished.

Please note that even though my function does not take any arguments, the return value is always unique because there is random noise added to the result in the body of the function.

I typically use `pool.map()` for this. https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.map Would that work for you? — Nick ODell, Dec 14 '22 at 18:59

score 2 · Accepted Answer · answered Dec 14 '22 at 19:01

you should use pool.imap_unordered which returns an iterator that fetches any result as fast as it is completed.

now if you want to keep track of which process returned that, while it is possible to be done from the parent, having the child return some sort of id would be better.

import multiprocessing as mp
import numpy as np
def simulate(iteration):  # simulates one realization
    ...
    return iteration, np.array([1,2,3])  # np.array of shape (20x2500x2500)


if __name__ == '__main__':
    iterations = 10
    pool = mp.Pool(6)
    results = pool.imap_unordered(simulate,range(iterations))

    sum_value = 0
    for result in results:
        print("iteration",result[0],"done")
        sum_value += np.sum(result[1])
    print(sum_value)

Multiprocessing in python - saving memory by adding results

1 Answers1