1

I have a function, let's call it randomized(). I need to run it on a list of identical odjects items.

When I run this with a simple for loop, like this

results = []
for item in items:
    results.append(randomized(item))

then the results contains a list of elements, which are different from each other as the function randomozed() is non-deterministic.

However, when I want to run this in parallel, like this:

pool = multiprocessing.Pool()
results = pool.map(randomized, items)

then the results list contains identical objects most of the time. Only rarely it contains just a single one which differs from others.

Why is the difference?

astromonkey
  • 443
  • 2
  • 5
  • 11
  • I added a `numpy.random.seed()` in the `randomized()` function, however it seems that still many of the result items are identical, even though it improved a bit... – astromonkey Jun 30 '21 at 12:34
  • as pointed out here: https://stackoverflow.com/questions/6914240/multiprocessing-pool-seems-to-work-in-windows-but-not-in-ubuntu it seems that "on Unix every worker process inherits the same state of the random number generator from the parent process" – astromonkey Jun 30 '21 at 12:37

1 Answers1

2

The trick is to initialize each process in the pool so that the random number generator is seeded with a unique seed. This is achieved by using the initializer argument of the Pool constructor.

The first demo uses the same seed for each process in the pool and shows that all processes will be returning the same random numbers (this is not what you want because each process in the pool is staring off with a random number generator that is in the same identical initial state):

import numpy as np
import multiprocessing
import time

def init_pool():
    np.random.seed(1)

def worker(i):
    # ensure each process in the pool processes one request each:
    time.sleep(1)
    return multiprocessing.current_process().pid, np.random.random()

if __name__ == '__main__':
    pool = multiprocessing.Pool(8, initializer=init_pool)
    results = pool.map(worker, range(8))
    for pid, number in results:
        print(f'pid={pid}, random number={number}')

Prints:

pid=46512, random number=0.417022004702574
pid=3444, random number=0.417022004702574
pid=13716, random number=0.417022004702574
pid=10800, random number=0.417022004702574
pid=47360, random number=0.417022004702574
pid=49932, random number=0.417022004702574
pid=51144, random number=0.417022004702574
pid=27360, random number=0.417022004702574

Note that on Linux/Unix the state of the random number generator would be inherited by all processes in the pool and thus they would automatically have the same initial identical state even without specifying a pool-initializer function as in the above code, which is, however, required for a platform such as Windows that uses spawn to create new processes.

The next demo initialized each process's random number generator with the current process' pid value (this is what you want since it guarantees that each processor in the pool starts off with a random number generator initialized with its own unique state):

import numpy as np
import multiprocessing
import time

def init_pool():
    np.random.seed(multiprocessing.current_process().pid)

def worker(i):
    # ensure each process in the pool processes one request each:
    time.sleep(1)
    return multiprocessing.current_process().pid, np.random.random()

if __name__ == '__main__':
    pool = multiprocessing.Pool(8, initializer=init_pool)
    results = pool.map(worker, range(8))
    for pid, number in results:
        print(f'pid={pid}, random number={number}')

Prints:

pid=19460, random number=0.5645643493822622
pid=23612, random number=0.5480593060571878
pid=28288, random number=0.2637370242174355
pid=6440, random number=0.6107958535345932
pid=24452, random number=0.6173634654672119
pid=1716, random number=0.2570898341750626
pid=14912, random number=0.11239641110464715
pid=49184, random number=0.34255660011034006
Booboo
  • 38,656
  • 3
  • 37
  • 60