27

I am studing the multiprocessing module of Python. I have two cases:

Ex. 1

def Foo(nbr_iter):
    for step in xrange(int(nbr_iter)) :
        print random.uniform(0,1)
...

from multiprocessing import Pool

if __name__ == "__main__":
    ...
    pool = Pool(processes=nmr_parallel_block)
    pool.map(Foo, nbr_trial_per_process)

Ex 2. (using numpy)

 def Foo_np(nbr_iter):
     np.random.seed()
     print np.random.uniform(0,1,nbr_iter)

In both cases the random number generators are seeded in their forked processes.

Why do I have to do the seeding explicitly in the numpy example, but not in the Python example?

ali_m
  • 71,714
  • 23
  • 223
  • 298
overcomer
  • 2,244
  • 3
  • 26
  • 39
  • Please explain what makes you think you *have to* – shx2 Apr 24 '15 at 18:01
  • 1
    Because if I don,t, then each of the forked processes will generate an identical sequence of random numbers (only in the Ex.2) – overcomer Apr 24 '15 at 18:04
  • Whatever the reason for the different behaviour is - it isn't trivial from a quick look at the source code - numpy's behaviour is not unexpected. Reproducibility is an important feature of PRNGs, and since the PRNG was already seeded when numpy was imported, the fork()s by multiprocessing shouldn't seed it again. – Phillip Apr 24 '15 at 19:08
  • I read: "For the Python random the seeding is handled internally by multiprocessing—if during a fork it sees that random is in the namespace, then it’ll force a call to seed the generators in each of the new processes.In numpy, we have to do this explicitly." But why? – overcomer Apr 24 '15 at 19:28
  • shX2 the os is Mac OS – overcomer Apr 24 '15 at 19:29
  • 1
    See this excellent answer to a similar, but not duplicate question: http://stackoverflow.com/a/5837352/2379433 – Mike McKerns Apr 24 '15 at 20:21
  • 1
    @overcomer - **numpy 1.17** just [introduced](https://docs.scipy.org/doc/numpy/reference/random/parallel.html?highlight=random) new options (I added an answer below) for "strategies implemented that can be used to produce repeatable pseudo-random numbers across multiple processes" – mork Jul 28 '19 at 07:41

3 Answers3

31

If no seed is provided explicitly, numpy.random will seed itself using an OS-dependent source of randomness. Usually it will use /dev/urandom on Unix-based systems (or some Windows equivalent), but if this is not available for some reason then it will seed itself from the wall clock. Since self-seeding occurs at the time when a new subprocess forks, it is possible for multiple subprocesses to inherit the same seed if they forked at the same time, leading to identical random variates being produced by different subprocesses.

Often this correlates with the number of concurrent threads you are running. For example:

import numpy as np
import random
from multiprocessing import Pool

def Foo_np(seed=None):
    # np.random.seed(seed)
    return np.random.uniform(0, 1, 5)

pool = Pool(processes=8)
print np.array(pool.map(Foo_np, xrange(20)))

# [[ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.28917586  0.40997875  0.06308188  0.71512199  0.47386047]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]]

You can see that groups of up to 8 threads simultaneously forked with the same seed, giving me identical random sequences (I've marked the first group with arrows).

Calling np.random.seed() within a subprocess forces the thread-local RNG instance to seed itself again from /dev/urandom or the wall clock, which will (probably) prevent you from seeing identical output from multiple subprocesses. Best practice is to explicitly pass a different seed (or numpy.random.RandomState instance) to each subprocess, e.g.:

def Foo_np(seed=None):
    local_state = np.random.RandomState(seed)
    print local_state.uniform(0, 1, 5)

pool.map(Foo_np, range(20))

I'm not entirely sure what underlies the differences between random and numpy.random in this respect (perhaps it has slightly different rules for selecting a source of randomness to self-seed with compared to numpy.random?). I would still recommend explicitly passing a seed or a random.Random instance to each subprocess to be on the safe side. You could also use the .jumpahead() method of random.Random which is designed for shuffling the states of Random instances in multithreaded programs.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • I want to share numpy random state of a parent process with a child process. I've tried using Manager but still no luck. Could you please take a look at my question [here](https://stackoverflow.com/questions/49372619/how-to-share-numpy-random-state-of-a-parent-process-with-child-processes) and see if you can offer a solution? I can still get different random numbers if I do np.random.seed(None) every time that I generate a random number, but this does not allow me to use the random state of the parent process, which is not what I want. Any help is greatly appreciated. – Amir Mar 20 '18 at 14:14
  • 1
    Yes, this is an excellent explanation and helped me very much. Thanks @overcomer, for asking the question. – max29 Dec 17 '18 at 23:18
4

numpy 1.17 just introduced [quoting] "..three strategies implemented that can be used to produce repeatable pseudo-random numbers across multiple processes (local or distributed).."

the 1st strategy is using a SeedSequence object. There are many parent / child options there, but for our case, if you want the same generated random numbers, but different at each run:

(python3, printing 3 random numbers from 4 processes)

from numpy.random import SeedSequence, default_rng
from multiprocessing import Pool

def rng_mp(rng):
    return [ rng.random() for i in range(3) ]

seed_sequence = SeedSequence()
n_proc = 4
pool = Pool(processes=n_proc)
pool.map(rng_mp, [ default_rng(seed_sequence) for i in range(n_proc) ])

# 2 different runs
[[0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
 [0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
 [0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
 [0.2825724770857644, 0.6465318335272593, 0.4620869345284885]]

[[0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
 [0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
 [0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
 [0.04503760429109904, 0.2137916986051025, 0.8947678672387492]]

If you want the same result for reproducing purposes, you can simply reseed numpy with the same seed (17):

import numpy as np
from multiprocessing import Pool

def rng_mp(seed):
    np.random.seed(seed)
    return [ np.random.rand() for i in range(3) ]

n_proc = 4
pool = Pool(processes=n_proc)
pool.map(rng_mp, [17] * n_proc)

# same results each run:
[[0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
 [0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
 [0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
 [0.2946650026871097, 0.5305867556052941, 0.19152078694749486]]
mork
  • 1,747
  • 21
  • 23
2

Here is a nice blog post that will explains the way numpy.random works.

If you use np.random.rand() it will takes the seed created when you imported the np.random module. So you need to create a new seed at each thread manually (cf examples in the blog post for example).

The python random module does not have this issue and automatically generates different seed for each thread.

Maelig
  • 2,046
  • 4
  • 24
  • 49
t_sic
  • 79
  • 1