2

In the example code below, I was trying to adapt the accepted answer in this thread. The goal is to use multi-processing to generate independent random normal numbers (in the example below I just want 3 random numbers). This is a baby version of any more complicated code where some random number generator is used in defining the trial function.

Example Code

import multiprocessing

def trial(procnum, return_dict):
    p = np.random.randn(1)
    num = procnum
    return_dict[procnum] = p, num

if __name__ == '__main__':
    manager = multiprocessing.Manager()
    return_dict = manager.dict()
    jobs = []
    for i in range(5):
        p = multiprocessing.Process(target=trial, args=(i,return_dict))
        jobs.append(p)
        p.start()

    for proc in jobs:
        proc.join()
    print(return_dict.values())

However, the output gives me the same random number every time, rather than an independent random number for each entry in return_dict.

Output

[(array([-1.08817286]), 0), (array([-1.08817286]), 1), (array([-1.08817286]), 2)]

I feel like this is a really silly mistake. Can someone expose my silliness please :)

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
Zhengyan Shi
  • 123
  • 5

2 Answers2

3

It's not a silly mistake, and it has to do with the way numpy does the staging across cores. Read more here: https://discuss.pytorch.org/t/why-does-numpy-random-rand-produce-the-same-values-in-different-cores/12005

But the solution is to give numpy a random seed from a large range:

import multiprocessing
import numpy as np
import random

def trial(procnum, return_dict):
    np.random.seed(random.randint(0,100000))
    p = np.random.randn()
    return_dict[procnum] = p

if __name__ == '__main__':
    manager = multiprocessing.Manager()
    return_dict = manager.dict()
    jobs = []
    for i in range(3):
        p = multiprocessing.Process(target=trial, args=(i,return_dict))
        jobs.append(p)
        p.start()

    for proc in jobs:
        proc.join()
    print(return_dict.values())
Aziz Sonawalla
  • 2,482
  • 1
  • 5
  • 6
  • It's a good answer, but I recommend seeding from a _much_ larger range; for example, `random.randrange(1 << 1000)`.Current versions of Python and numpy exploit integer seeds of any size, and unintended collisions are too likely with a small range ("birthday paradox") – Tim Peters Aug 31 '20 at 18:12
2

Just adding a gloss to @Aziz Sonawalla's answer: why does this work?

Because Python's random module works differently. On Windows, multiprocessing spawns new processes, and each is a freshly created instance that does its own from-scratch seeding from OS sources of entropy.

On Linux, by default multiprocessing uses fork() to create new processes, and those inherit the entire state of the main process, in copy-on-write mode. That includes the state of the random number generator. So you would get the same random numbers across worker processes from Python too, except that, at least since Python 3.7, Python explicitly (but under the covers - invisibly) re-seeds its random number generator after fork().

I'm not sure when, but for some time before 3.7 the multiprocessing Process implementation also re-seeded Python's generator in child processes it created via fork() (but Python itself did not if you called fork() yourself).

All of which is just to explain why calling Python's random.randrange() returns different results in different worker processes. That's why it's an effective way to generate differing seeds for numpy to use in this context.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132