Fill a range with random numbers and execution policy

Question

We can fill a range [first, last) using

std::mt19937 g;
std::uniform_real_distribution<> u;
std::generate(first, last, [&]() { return u(g); });

Theoretically, it would be more performant to execute std::generate with the execution policy std::execution::par_unseq. However, we could write:

std::generate(std::execution::par_unseq, first, last, [&]() { return u(g); });

but is this really safe? I think, parallel access to g might be problematic. If that's actually true, how can we fix this? I've seen some strange looking code like this:

std::generate(std::execution::par_unseq, first, last, []()
{
    thread_local std::mt19937 g;
    thread_local std::uniform_real_distribution<> u;
    return u(g);
});

But is thread_local really sensible here? And how should g be seeded here if the generated samples are supposed to be independent?

Might be relevant: [How do I generate thread-safe uniform random numbers?](https://stackoverflow.com/questions/21237905/how-do-i-generate-thread-safe-uniform-random-numbers). — wohlstad, Aug 23 '23 at 13:03
This should be enough: `thread_local std::mt19937 g{std::random_device{}()};` — Marek R, Aug 23 '23 at 13:21

tbxfreeware · Answer 1 · 2023-08-23T23:38:46.013

It's gotta be `thread_local`, or else use a mutex!

But is thread_local really sensible here?

Unless you are willing to to guard your RNG with a mutex, then you must have a separate RNG for each thread. So, yes, thread_local is sensible.

When you have a large number of threads, however, this can be a problem, because std::mt19937 has a relatively large footprint. In that case, I suggest switching to something like PCG by Melissa O'Neill.

If your threads are not long-lived, std::mt19937 may present a performance problem due to the long time required to seed its large internal state. Once again, an easy solution would be to switch to something like pcg64, by Melissa O'Neill. It can be seeded quickly, with just four calls to std::random_device.

seed_randomly, described below, can do the work for you.

Manage a pool of random engines and distributions

I say above that there are only two choices: use thread_local, or else use a mutex. There is, however, a third alternative: the one presented here in the answer by @Marco Bonelli. That is to create (and manage) a pool of random number engines from which an RNG can be issued to a thread when needed.

If you are using one of the random number distributions from the Standard Library, such as uniform_int_distribution, you also need to make sure you are not sharing an instantiation of it across threads. Many (almost all?) of the distributions have internal state, so you need to protect yourself when several different threads are contending for access.

Seeding

And how should g be seeded here if the generated samples are supposed to be independent?

std::mt19937 and std::mt19937_64 both have have 19,968 bits of state, so trying to seed them with a single seed of only 32 or 64 bits is problematic.

So long as std::random_device is a good source of entropy on your system, a good way to thoroughly seed std::mt19937, is to use std::random_device to fill all 624 state variables of std::mt19937.

I posted a small, single-file header in a repository named seed_randomly on GitHub that uses std::random_device to seed all 19,968 bits of a mersenne twister engine. It was created with the help of some of the smart guys here on StackOverflow. See this StackOverflow answer.

Before using this header, you should satisfy yourself that std::random_device is a good source of entropy on your system. Sometimes, it is not.

Microsoft Visual C++, for instance, generates "non-deterministic and cryptographically secure" values, and never blocks, which is excellent. Prior to version 9.2, however, MinGW distributions of GCC used std::mt19937 with a fixed seed! Those systems generated the same sequence every time. (Newer versions purport to have fixed the problem, but I have not checked.) Unix-like systems often use /dev/random (which can block) or /dev/urandom. Both have their advantages.

Overlapping sequences can produce correlated results

The random seeding described above hits all 624 state variables of std::mt19937. The odds are infinitesimal, therefore, that two such generators would produce sequences that overlap. However, that is not zero! Given the large state space of std::mt19937, it probably is not a real concern.

For rigorous scientific and academic work, that may not be good enough. In those arenas, you may be required to prove that the RNG sequences do not overlap. In that case, you should investigate the practicality of calling member function discard to jump ahead by a different multiple of some predetermined large amount on each thread.

That is guaranteed to be an O(1) operation with PCG. It can also be O(1) for a linear congruential generator, but there is no guarantee that a specific Standard Library implementation has done so. I do not know the status of std::mt19937. My recollection is just vague enough that I cannot state it here.

By the way, std::mt19937 has mathematically proven statistical properties that make it suitable for certain rigorous applications where PCG cannot be used. That is because the randomness of PCG has not been proven a priori, using mathematics. Instead, long sequences produced by PCG are tested empirically, using programs such TestU01. The situation is a little bit ironic, because pcg64 outperforms std::mt19937 in some of the empirical tests.

Marco Bonelli · Answer 2 · 2023-08-23T15:15:20.767

Disclaimer

Note that documentation for std::execution::par_unseq states that "execution may be parallelized". So par_unseq does not necessarily parallelize anything and acts according to the underlying implementation (i.e., its behavior will be implementation specific). In fact, the GCC on my system (10.2.1) does not seem to have a parallelized implementation of par_unseq and only uses one thread.

Given the above, and given that you say you want to perform the sampling multiple times, it's really just better/simpler to implement the parallel sampling yourself with OpenMP or similar alternatives.

Manual implementation with OpenMP

Note: compile with -fopenmp.

#include <random>
#include <vector>
#include <iostream>
#include <omp.h>

static void sample(std::vector<double> &out) {
    static std::vector<std::mt19937> generators;

    #pragma omp parallel
    {
        // Initialize generators *only once* and *using a single thread*
        #pragma omp single
        {
            if (generators.begin() == generators.end()) {
                for (int i = 0; i < omp_get_num_threads(); i++) {
                    // Alternatively use another seed here as you wish
                    generators.push_back(std::mt19937{std::random_device{}()});
                }
            }
        }

        // Get the cached generator for this thread
        std::mt19937 &g = generators[omp_get_thread_num()];
        std::uniform_real_distribution<> u;

        // Do the parallel sampling
        #pragma omp for
        for (std::size_t i = 0; i < out.size(); i++)
            out[i] = u(g);
    }
}

int main(void) {
    std::vector<double> vec(1000);

    for (unsigned i = 0; i < 10; i++)
        sample(vec);
}

Original answer(s)

I think, parallel access to g might be problematic

Yep, exactly, which is what the thread_local declarations in your second code snippet are trying to avoid.

is thread_local really sensible here?

Yes, thread_local storage duration is critical in this case from a both a performance and thread safety point of view. You could theoretically omit thread_local and create local variables each invocation, but as you may know generator initialization is an heavy operation that really only needs to be done once. A thread_local definition is a nice and simple way to address both thread safety and one-time initialization at once, as each thread local object will only be instantiated once per thread and live as long as the thread.

how should g be seeded here if the generated samples are supposed to be independent?

std::mt19937 should already solve this problem by itself provided that you seed all the generators differently.

The only problem I can see in the second code snippet is that g is always seeded (by the default constructor) with the same default seed, so you will end up with a bunch of copies of the same values (one copy per thread). You should use different seeds, and a good way of doing that could be using a random seed for each thread, as Marek R suggests in the comments above:

thread_local std::mt19937 g{std::random_device{}()};

Addressing further concerns:

I'm loosing the opportunity to seed the generator with a "good chosen" seed from a bootstrapping step

This shouldn't be a problem, if you are able to write a function that returns the correct seed on each invocation, you can use it as follows:

// Just as an example
unsigned get_seed(void) {
    static unsigned cur = 1234;
    return cur++;
}

std::generate(std::execution::par_unseq, first, last, []()
{
    thread_local std::mt19937 g{get_seed()};
    // ...
}

I continuously need to sample such a random vector for a monte carlo integration

Ok, this is a different story then, because in this case the above code would re-instantiate g for each thread that is spawned according to par_unseq every time you re-sample the vector. std::generate with a std::ececution`` policy is not really a good shortcut in your case. If you really want to stick to it, you will have to drop the thread_localand create a globalstaticstorage (such as a map) to cache generators between different sampling runs and retrieve them from within the lambda passed tostd::generate()`.

Writing a simple manual implementation of the parallel sampling using OpenMP would be much easier to manage in your case. See above for the code.

Thank you for your answer. I was wondering whether initializing the generator with a new seed in each thread is really a good idea. First of all, I'm loosing the opportunity to seed the generator with a "good chosen" seed from a bootstrapping step. Beyond please imagine that I'm executing the code snippet to generate a random vector and I continuously need to sample such a random vector for a monte carlo integration. Is it in any sense problematic to use `thread_local std::mt19937 g{std::random_device{}()};` in that scenario? — 0xbadf00d, Aug 23 '23 at 14:17
I also wonder if it might even be slower, since multiple generators need to be initialized. — 0xbadf00d, Aug 23 '23 at 14:17
`std::mt19937 should already solve this problem by itself provided that you seed all the generators differently.` This is wrong [doc says](https://en.cppreference.com/w/cpp/numeric/random/mersenne_twister_engine/mersenne_twister_engine) `Default constructor. Seeds the engine with default_seed.`. — Marek R, Aug 23 '23 at 15:01
@MarekR yeah I am talking about the independence of the values produced by a single instance of `mt19937` there. Maybe I worded that in an unclear way. I already stated that the default seed should be changed. — Marco Bonelli, Aug 23 '23 at 15:09
@0xbadf00d see my updated answer. I believe in your case you are better off avoiding `std::generate` altogether. See the disclaimer I added at the top and the code example. — Marco Bonelli, Aug 23 '23 at 15:10