2

I have a function that creates a large mask (boolean arrays). I want to call this function several times and create a total mask of the same shape that is True at indices that are True in any of the individual masks.

Since the calculation of the masks takes much time I have parallelized it but the function consumes a lot of memory now because I am first creating all individual masks and then combining them, which means that I have to store all ~40.000 individual masks. Is there a possibility to directly add the returned individual masks to a total mask before calculating the next mask using multiprocessing?

This is an example code for the problem:

import numpy as np
from multiprocessing import Pool


def return_something(seed):
    np.random.seed(seed)
    return np.random.choice([True, False], size=shape, p=[0.1, 0.9])


shape = (50, 50)
ncores = 4
seeds = np.random.randint(low=0, high=np.iinfo(np.int32).max, size=10)

# Without parallelisation, very slow:
mask = np.zeros(shape, dtype=bool)
for seed in seeds:
    mask |= return_something(seed)


# With parallelisation, takes too much memory
p = Pool(ncores)
mask_parallel = np.any(list(p.imap(return_something, seeds)), axis=0)

I think I do not understand the (i)map functions enough. I know multiprocessing.imap returns a generator and it is possible to show for example a progress bar using tqdm with the following code:

list(tqdm.tqdm(p.imap(fct, inputs), total=len(inputs))

Since the progress bar is updated during the multiprocessing run I think it must be possible to already access the results during the run and maybe summing them up but I do not know how.

Thanks for your help!

1 Answers1

1

Iterating through the seeds will not make sense as you're creating a very large array ech time in return_somethign. So you will have to slice this array creation into some sub-creations and iterating through these sub-creations. The Pool.map() method returns a list of the results of the executed function in each iteration. To show you the general implementation of this for your case. What I'm doing is just parallize the creation of each row and putting them together via the map() function.

import numpy as np
import multiprocessing as mp

def return_something(i):
    mask = np.random.choice([True, False], size=(shape[0],), p=[0.1, 0.9])
    return mask

shape = (5000, 5000)

if __name__ == "__main__":
    pool = mp.Pool(mp.cpu_count())
    results = pool.map(return_something, [i for i in range(shape[1])])
    pool.close()
    print(len(results))

Regarding your comments, I'm showing a way to append the resulting items to a list once they are computed (on the fly)

import numpy as np
from multiprocessing import Pool
import time

def return_something(seed):
    np.random.seed(seed)
    return np.random.choice([True, False], size=shape, p=[0.1, 0.9])


shape = (50, 50)
ncores = 4
seeds = np.random.randint(low=0, high=np.iinfo(np.int32).max, size=100000)

mask = []

if __name__ == "__main__":
    p = Pool(12)
    start = time.time()
    for res in p.imap(return_something, seeds, chunksize=1):
        mask.append(res)
        print("{} (Time elapsed: {}s)".format(len(res), time.time() - start))

    p.close()
    print(len(mask))
Alexander Riedel
  • 1,329
  • 1
  • 7
  • 14
  • Thanks but return_something is just defined to create a working example, it has nothing to do with the function that I use in the code (and which does not use random numbers). Also in your example all results of the iterations are stored, I am looking for a way to immediatly sum the results up without storing all individual results. – Peter Lustig Nov 19 '20 at 12:14
  • Just to explain why I manually set the seed here: all workers start with the same seed, so if you to not set random seeds as in your example you will get any random array ncores times, the result of iteration i will be the same for any of the workers. – Peter Lustig Nov 19 '20 at 12:25
  • 1
    ok got you! maybe check here https://stackoverflow.com/a/26521507/11951277 for how map and imap make a difference. imap indeed lets you access the results of the function on the fly. I also edited my post – Alexander Riedel Nov 19 '20 at 16:03
  • Great thanks a lot! If you agree on what I said about the seed I would still suggest to remove the first part of your answer (or mention the behavior in it) because if results rely on random numbers the parallelisation in your first example makes the statistics worse. Assuming you have Nw workers that need the same time for an interation step and you have Nit interrations in total you will only get Nit / Nw different results, each appearing Nw times, since all workers start with the same seed from the parent process. This is dangerous because one would assume Nit independent results. – Peter Lustig Nov 20 '20 at 07:49