9

I am running a monte-carlo simulation in parallel using joblib. I noticed however although my seeds were fixed my results kept changing. However, when I ran the process in series it remained constant as I expect.

Below I implement a small example, simulating the mean for a normal distribution with higher variance.

Load Libraries and define function

import numpy as np
from joblib import Parallel, delayed

def _estimate_mean():
    np.random.seed(0)
    x = np.random.normal(0, 2, size=100)
    return np.mean(x)

The first example I implement in series - the results are all the same as expected.

tst = [_estimate_mean() for i in range(8)]
In [28]: tst
Out[28]:
[0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897]

The second example I implement in Parallel: (Note sometimes the means are all the same other times not)

tst = Parallel(n_jobs=-1, backend="threading")(delayed(_estimate_mean)() for i in range(8))

In [26]: tst
Out[26]:
[0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.1640259414956747,
 -0.11846452111932627,
 -0.3935934130918206]

I expect the parallel run to be the same as the seed is fixed. I found if I implement RandomState to fix the seeds it seems to resolve the problem:

def _estimate_mean():
    local_state = np.random.RandomState(0)
    x = local_state.normal(0, 2, size=100)
    return np.mean(x)
tst = Parallel(n_jobs=-1, backend="threading")(delayed(_estimate_mean)() for i in range(8))

In [28]: tst
Out[28]:
[0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897]

What is the difference between using RandomState and just seed when fixing the seeds using numpy.random and why would the latter not reliably work when running in parallel ?

System Information

OS: Windows 10

Python: 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]

Numpy: 1.17.2

PyRsquared
  • 6,970
  • 11
  • 50
  • 86
RK1
  • 2,384
  • 1
  • 19
  • 36
  • 2
    There's a good discussion: https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#5837352 , do check the comments of the answer, really helpful. And another answer: https://stackoverflow.com/questions/37224116/difference-between-randomstate-and-seed-in-numpy – Hongpei Nov 29 '19 at 13:49
  • It could be that the seed isn't being reset for each thread – PyRsquared Nov 29 '19 at 14:08

1 Answers1

5

The result you're getting with numpy.random.* is happening because of a race condition. numpy.random.* uses only one global PRNG that is shared across all the threads without synchronization. Since the threads are running in parallel, at the same time, and their access to this global PRNG is not synchronized between them, they are all racing to access the PRNG state (so that the PRNG's state might change behind other threads' backs). Giving each thread its own PRNG (RandomState) solves this problem because there is no longer any state that's shared by multiple threads without synchronization.


Since you're using NumPy 1.17, you should know that there is a better alternative: NumPy 1.17 introduces a new random number generation system; it uses so-called bit generators, such as PCG, and random generators, such as the new numpy.random.Generator.

It was the result of a proposal to change the RNG policy, which states that numpy.random.* functions should generally not be used anymore. This is especially because numpy.random.* operates on global state.

The NumPy documentation now has detailed information onβ€”

In the new RNG system. See also "Seed Generation for Noncryptographic PRNGs", from an article of mine with general advice on RNG selection.

Peter O.
  • 32,158
  • 14
  • 82
  • 96