Jumping ahead in parallelised PRNGs in C++

Question

I am implementing a Monte Carlo simulation, where I need to run multiple realisations of some dynamics and then take an average over the end state for all the simulations. Since the number of realisation is large, I run them in parallel using OpenMP. Every realisation starts from the same initial conditions and then at each time step a process happens with a given probability, and to determine which process I draw a random number from a uniform distribution.

I want to make sure that all simulations are statistically independent and that there is no overlap in the random numbers that are being drawn.

I use OpenMP to parallelise the for loops, so the skeleton code looks like this:

vector<int> data(number_of_sims);
double t;
double r;
#pragma omp parallel for
for(int i = 0; i < number_of_sims; i++){

    // run sim
    t = 0;
    while (t < T) {
        r = draw_random_uniform();
        if (r < p) do_something();
        else do_something_else();
        t += 1.0;  // increment time
    }

    // some calculation
    data[i] = calculate();
}

So every time I want a random number, I would call a function which used the Mersenne Twister seeded with random device.

double draw_random_uniform(){
   static thread_local auto seed = std::random_device{}();
   static thread_local mt19937 mt(seed);
   std::uniform_real_distribution<double> distribution(0.0, 1.0);
   double r = distribution(mt);
   return r;
}

However, since I ultimately want to run this code on a high power computing cluster I want to avoid using std::random_device() as it is risky for systems with little entropy.

So instead I want to create an initial random number generator and then jump it forward a large amount for each of the threads. I have been attempting to do this with the Xoroshiro256+ PRNG (I found some good implementation here: https://github.com/Reputeless/Xoshiro-cpp). Something like this for example:

XoshiroCpp::Xoshiro256Plus prng(42);  // properly seeded prng
#pragma omp parallel num_threads()
{
    static thread_local XoshiroCpp::Xoshiro256Plus lprng(prng);  // thread local copy
    lprng.longJump();  // jump ahead

    // code as before, except use lprng to generate random numbers
    # pragma omp for
    ....
}

However, I cannot get such an implementation to work. I suspect because of the double OpenMP for loops. I had the thought of pre-generating all of the PNRGs and storing in a container, then accessing the relevant one by using omp_get_thread_num() inside the parallelised for loop.

I am unsure if this is the best way to go about doing all this. Any advice is appreciated.

You have data race in your code: `r` and `t` should be private. Always declare your variables in the minimum scope required. `def random():` is not C++. In your last section of code, `# pragma omp parallel for` should be `# pragma omp for`. — Laci, Jan 30 '23 at 17:24
This seems pretty simple. You have a centralized source of randomness seeded by the random device. When you start one of these processes, it gets its own RNG that is seeded by the central source. There's no need for any of them to "jump ahead", since you already seeded the central source. If the central source is a good one, then the seed drawn from it will also be good. And yes, the central source will need to be properly mutex-guarded. — Nicol Bolas, Jan 30 '23 at 18:46
In your last code block the first generator should read `prng`, not `lprng` for the code to make sense. — paleonix, Jan 30 '23 at 21:18
I would imagine that you would want the result to be deterministic for a given global seed. That would make debugging much easier. — paleonix, Jan 30 '23 at 21:22
Thanks for these comments, I think I have corrected any basic issues in the code. So will seeding a central source with random device and then seeding thread local copies with the source essentially guarantee I won't be reusing sequences of random numbers? Also what is meant by 'mutex-guard', is this just something that prevents the central PRNG being accessed by multiple threads simultaneously? @NicolBolas — CKitching, Jan 31 '23 at 11:47
There is still the nested `parallel` in the last code sample that @Laci mentioned. Yes, the mutex could be implemented using an OpenMP (named) `critical` section around the parallel accesses to the global `prng`. — paleonix, Jan 31 '23 at 12:30
I'm not quite sure about the safety of seeding the locals with sequential iterations of the global PRNG. You certainly do not want them to be "copies" of the global PRNG at "close" iterations. If the PRNG supports **cheap** jump-ahead I don't see a reason not to use that to guarantee that they are far apart in iteration space. I think some modern PRNGs (e.g. PCG) also support multiple "streams" of output. Using jump-ahead or streams should also make it easier to get the same deterministic result independent of the number of threads. — paleonix, Jan 31 '23 at 12:54
*I think I have corrected any basic issues in the code.* Well, unfortunately, none of the problems I mentioned have been solved. `r` and `t` are still shared and there are race conditions. `int random():` is still too pythonic . `#pragma omp parallel` has not changed. — Laci, Jan 31 '23 at 12:56
I think you vastly overestimate the risk of the random device running out of entropy (these issues theoretically exist, but they are relevant for cryptography, not for MC simulations). But independently of this I think using the random device to seed parallel RNGs for MC has issues. If nothing else, it makes it categorically impossible to reproduce the results. It's safer to derive thread-local seed from one global meta-seed. — Konrad Rudolph, Jan 31 '23 at 16:17

Chris · Answer 1 · 2023-01-31T15:09:37.470

Coordinating random number generators with long jump can be tricky. Alternatively there is a much simpler method.

Here is a quote from the authors website:

It is however important that the period is long enough. Moreover, if you run n independent computations starting at random seeds, the sequences used by each computation should not overlap.

Now, given a generator with period P, the probability that
subsequences of length L starting at random points in the state space overlap is bounded by n² L/P. If your generator has period 2^256 and you run on 2^64 cores (you will never have them) a computation using 2^64 pseudorandom numbers (you will never have the time) the probability of overlap would be less than 2^-64.

So instead of trying to coordinate, you could in each thread just randomly seed a new generator from std::random_device{}. The period is so large that it will not collide.

While this sounds like a very add-hock approach, this random-seeding method is actually a widely used and classic method.

You just need to make sure the seeds are different. Depending on the platform usually different random seeds are proposed.

Using a truly random source
Having an atomic int that is incremented and some hashing
Using another pseudo random number generator to generate a seed sequence
Using a combination of thread id and time to create a seed

If repeatability is not needed, seeds from a random source is the most easiest and safest solution.

The paper from L'Ecuyer et. al. from 2017 gives a good overview of methods for generating parallel streams. He calls this approach "RNG with a “random” seed for each stream` under chapter 4.

vector<int> data(number_of_sims);
double t;
double r;
#pragma omp parallel for
for(int i = 0; i < number_of_sims; i++){
    // random 128 bit seed
    auto rd = std::random_device{};
    auto seed = std::seed_seq {rd(), rd(), rd(), rd()};
    auto mt = std::mt19937 {seed};

    // run sim
    t = 0;
    while (t < T) {
        r = draw_random_uniform(mt);
        if (r < p) do_something();
        else do_something_else();
        t += 1.0;  // increment time
    }

    // some calculation
    data[i] = calculate();
}

and

double draw_random_uniform(mt19937 &mt){
   std::uniform_real_distribution<double> distribution(0.0, 1.0);
   return distribution(mt);
}

If number_of_sims is not extremely large there is no need for static or thread_local initialization.

Sadly `std::random_device` is somewhat problematic as it doesn't guarantee to not just use a PRNG with fixed seed. See e.g. [Why do I get the same sequence for every run with std::random_device with mingw gcc4.8.1?](https://stackoverflow.com/q/18880654/10107454). — paleonix, Jan 31 '23 at 15:17
And the xoshiro site linked by you also says: "All generators, being based on linear recurrences, provide jump functions that make it possible to simulate any number of calls to the next-state function in constant time, once a suitable jump polynomial has been computed. We provide ready-made jump functions for a number of calls equal to the square root of the period, to make it easy generating non-overlapping sequences for parallel computations, and equal to the cube of the fourth root of the period, to make it possible to generate independent sequences on different parallel processors." — paleonix, Jan 31 '23 at 15:22
To avoid race conditions, I suggest using the `private(t,r)` clause or declaring them in the parallel region. — Laci, Jan 31 '23 at 17:54

score 0 · Answer 2 · answered Jan 31 '23 at 16:10

You should read "Parallel Random Numbers, as easy as one, two three" http://www.thesalmons.org/john/random123/papers/random123sc11.pdf This paper explicitly addresses your forward stepping issues. You can now find implementations of this generator in maths libraries (such as Intel's MKL, which uses the specialized encryption instructions, so will be hard to beat by hand!)

Jumping ahead in parallelised PRNGs in C++

2 Answers2