Mersenne twister warm up vs. reproducibility

Question

In my current C++11 project I need to perform M simulations. For each simulation m = 1, ..., M, I randomly generate a data set by using a std::mt19937 object, constructed as follows:

std::mt19937 generator(m);
DatasetFactory dsf(generator);

According to https://stackoverflow.com/a/15509942/1849221 and https://stackoverflow.com/a/14924350/1849221, the Mersenne Twister PRNG benefits from a warm up phase, which is currently absent in my code. I report for convenience the proposed snippet of code:

#include <random>

std::mt19937 get_prng() {
    std::uint_least32_t seed_data[std::mt19937::state_size];
    std::random_device r;
    std::generate_n(seed_data, std::mt19937::state_size, std::ref(r));
    std::seed_seq q(std::begin(seed_data), std::end(seed_data));
    return std::mt19937{q};
}

The problem in my case is that I need reproducibility of results, i.e., among different executions, for each simulation, the data set has to be the same. That's the reason why in my current solution I use the current simulation to seed the Mersenne Twister PRNG. It seems to me that the usage of std::random_device prevents data from being the same (AFAIK, this is the exact purpose of std::random_device).

EDIT: by different executions I mean re-launching the executable.

How can I introduce the afore-mentioned warm up phase in my code without affecting reproducibility? Thanks.

Possible solution #1

Here's a tentative implementation based on the second proposal by @SteveJessop

#include <random>

std::mt19937 get_generator(unsigned int seed) {
        std::minstd_rand0 lc_generator(seed);
        std::uint_least32_t seed_data[std::mt19937::state_size];

        std::generate_n(seed_data, std::mt19937::state_size, std::ref(lc_generator));
        std::seed_seq q(std::begin(seed_data), std::end(seed_data));
        return std::mt19937{q};
    }

Possible solution #2

Here's a tentative implementation based on the joint contribution by @SteveJassop and @AndréNeve. The sha256 function is adapted from https://stackoverflow.com/a/10632725/1849221

#include <openssl/sha.h>
#include <sstream>
#include <iomanip>
#include <random>

 std::string sha256(const std::string str) {
    unsigned char hash[SHA256_DIGEST_LENGTH];
    SHA256_CTX sha256;
    SHA256_Init(&sha256);
    SHA256_Update(&sha256, str.c_str(), str.size());
    SHA256_Final(hash, &sha256);

    std::stringstream ss;
    for(int i = 0; i < SHA256_DIGEST_LENGTH; i++) 
        ss << std::hex << std::setw(2) << std::setfill('0') << (int)hash[i];

    return ss.str();
}

std::mt19937 get_generator(unsigned int seed) {
    std::string seed_str = sha256(std::to_string(seed));
    std::seed_seq q(seed_str.begin(), seed_str.end());
    return std::mt19937{q};
}

Compile with: -I/opt/ssl/include/ -L/opt/ssl/lib/ -lcrypto

Do you mean that the quality of the pseudo-random sequence will improve as you ask for new data? My objective is to explicitly take into account `std::mt19937::state_size` in the initialization phase, while retaining reproducibility. — Ilio Catallo, Apr 18 '13 at 09:16
All random number generators have a member function `discard(n)` to advance the internal state *as-if* calling `operator()` `n`-times. — Xeo, Apr 18 '13 at 09:23
Does the `discard(n)` operation achieve the same result of using a `std::seed_seq` as large as the `std::mt19937::state_size` to seed the PRNG? What is an appropriate `n` to be used? — Ilio Catallo, Apr 18 '13 at 09:28
In "possible 2", `std::hash` isn't good enough. The problem with MT that you're trying to overcome is that it needs a lot of non-zero bits of seed data, otherwise its internal state is mostly 0 and it outputs bad data. `std::hash` isn't the right kind of hash to solve that. At best it still only provides 64 bits of seed data, and it's worse than that since it's quite likely the identity operation. If you used for example the SHA256 hash of `m` then you might be in business. — Steve Jessop, Apr 19 '13 at 12:31
You are right, In case of integer values `std::hash` is probably implemented as the identity function. I will try to revise solution #2 so as to use `openssl/sha.h` — Ilio Catallo, Apr 19 '13 at 12:58
I know I'm late, but I finally had the chance to correct the 2nd solution — Ilio Catallo, May 08 '13 at 19:38

Steve Jessop · Accepted Answer · 2013-04-18T11:14:58.510

5

Two options:

Follow the proposal you have, but instead of using std::random_device r; to generate your seed sequence for MT, use a different PRNG seeded with m. Choose one that doesn't suffer like MT does from needing a warmup when used with small seed data: I suspect an LCG will probably do. For massive overkill, you could even use a PRNG based on a secure hash. This is a lot like "key stretching" in cryptography, if you've heard of that. You could in fact use a standard key stretching algorithm, but you're using it to generate a long seed sequence rather than large key material.
Continue using m to seed your MT, but discard a large constant amount of data before starting the simulation. That is to say, ignore the advice to use a strong seed and instead run the MT long enough for it to reach a decent internal state. I don't know off-hand how much data you need to discard, but I expect the internet does.

edited Apr 18 '13 at 11:14

answered Apr 18 '13 at 11:02

Steve Jessop

273,490
39
460
699

1

The problem is that he is using `std::random_device` to seed the Mersenne Twister on _every_ run, which will (essentially) never have reproducible effects. He can indeed use another PNRG that does not need a warmup phase but he still cannot generate a new seed on every run. He must generate the seed once and store it somewhere between runs to achieve reproducibility. It does not need to be a strong seed, obviously. – André Neves Apr 18 '13 at 11:08
@AndréNeves: I don't think the questioner wants a new seed on every run. He wants `M` distinct seeds that (a) are the same on every run; but (b) unlike the values `1 ... M` that he's currently using, are not weak seeds for MT. The question states that he is *not* currently using the `random_device` code. He is currently seeding a MT with a small integer, and the `random_device` code is a proposal that he found on a different SO question, that isn't suitable for him. – Steve Jessop Apr 18 '13 at 11:10
1

Ok, I thought he was considering using the above snippet. Anyways, I do understand he wants to have `M` sets of seeds and if he wants to use decent seeds he can use the `random_device` without any problem, he simply has to store it for each dataset. One "cheap" way I remember for not using a strong seed and not havng to store it but still be relatively random is to use a hash of the dataset (or some part) as a seed for the MT (or other PNRG). This way he wouldn't need to store it as it would be deterministic for each dataset. – André Neves Apr 18 '13 at 11:16
1

@AndréNeves: Agreed, he could store true random data. For that matter he wouldn't need a file/db/whatever, he could generate them once and compile them in (resource file, convert to hex and paste into the source, whatever). Taking a single hash of `m` is a degenerate case of key stretching -- not good enough for cryptography but good enough here provided that the size of the hash isn't too small compared with `std::mt19937::state_size`. If it is too small, then another improvised key stretching algorithm would be to concatenate the hashes of `m`, `m+M`, `m+2M`, ... until you have enough data. – Steve Jessop Apr 18 '13 at 11:21
Yes, he could store the seeds as a resource file or directly in the code, but it would be parhaps too static and even hardcoded. Anyways, I guess the hash as a seed is a good solution, and the only possible problem is the length of it, as you mention, but which is easily circumvented. – André Neves Apr 18 '13 at 11:30
Firstly, thank you for discussion so much about my problem. Regarding your suggestion, the data set will be generated starting from the PRNG. If I understand correctly your idea, you are proposing to seed the PRNG by computing some kind of hash on the data set. The point is that I can't generate the data set until I obtain a PRNG. – Ilio Catallo Apr 19 '13 at 07:39
@burton0: in your case, the "data set" André mentions that the seed is computed from is the number `m`. Then of course you use that seed to generate the thing that the question calls the "data set", that you're going to run your simulation on. – Steve Jessop Apr 19 '13 at 08:32
I edited the original question so as to accomodate your proposal. There's still a warning about the implicit casting from `size_t` to `unsigned int` – Ilio Catallo Apr 19 '13 at 12:14

André Neves · Answer 2 · 2013-04-18T11:37:39.883

I think that you only need to store the initial seed (in your case the std::uint_least32_t seed_data[std::mt19937::state_size] array) and the number n of warmup steps you made (eg. using discard(n) as mentioned) for each run/simulation you wish to reproduce.

With this information, you can always create a new MT instance, seed it with the previous seed_data and run it for the same n warmup steps. This will generate the same sequence of values onwards since the MT instance will have the same inner state when the warmup ends.

When you mention the std::random_device affecting reproducibility, I believe that in your code it is simply being used to generate the seed data. If you were using it as the source of random numbers itself, then you would not be able to have reproducible results. Since you are using it only to generate the seed there shouldn't be any problem. You just can't generate a new seed every time if you want to reproduce values!

From the definition of std::random_device:

"std::random_device is a uniformly-distributed integer random number generator that produces non-deterministic random numbers."

So if it's not deterministic you cannot reproduce the sequence of values produced by it. That being said, use it simply to generate good random seeds only to store them afterwards for the re-runs.

Hope this helps

EDIT :

After discussing with @SteveJessop, we arrived at the conclusion that a simple hash of the dataset (or part of it) would be sufficient to be used as a decent seed for the purpose you need. This allows for a deterministic way of generating the same seeds every time you run your simulations. As mentioned by @Steve, you will have to guarantee that the size of the hash isn't too small compared with std::mt19937::state_size. If it is too small, then you can concatenate the hashes of m, m+M, m+2M, ... until you have enough data, as he suggested.

I am posting the updated answer here as the idea of using a hash was mine, but I will upvote @SteveJessop's answer because he contributed to it.

I need to create `M` different data sets, which will be used in `M` different simulations. I need the data sets to be independent, i.e., each data set is generated using a PRNG whose seed is unique for the specific simulation. My current solution achieves uniqueness by using a sequence of seeds from `1` to `M`. It seems that you are suggesting to create at the beggining of the code a _master_ seed sequence `q` so that the resulting `std::mt19937` will be modified in each simulation by using `discard(n)`, where `n` is different for each simulation. Right? — Ilio Catallo, Apr 18 '13 at 10:47
You need to create a _master_ seed sequence (`seed_data` array) and a _master_ `n` for **each** simulation you want to reproduce. So, if I understood correctly you will have `M` different _master_ `seed_data`and `M` `n`'s, one for each simulation. For each simulation, you will then call `seed(seed_data)` and then `discard(n)`. You could create arrays of size `M` for your `M` sets of parameters and then update your `std::mt19937` instance using `seed(seed_data_array[i])` and then `discard(n_array[i])`, where i is the dataset identifier (0..M-1). — André Neves, Apr 18 '13 at 10:56
So the problem is where to take the `M` different master seed sequences and `n`s. I edited the original question to stress that by _different executions_ I mean re-launching the executable. — Ilio Catallo, Apr 18 '13 at 11:00
You will have to store the `M` master seed sequences and `n`s in a persistent way (e.g. file, database) so that you can load them later when re-launching the executable. So you could have a simple argument that when true runs the simulation and _generates_ the master seed data, and when false simply _reads_ the master seed data and effectively reproduces the simulations. — André Neves, Apr 18 '13 at 11:03

score 1 · Answer 3 · answered Apr 18 '13 at 19:25

A comment on one of the answers you link to indicates:

Coincidentally, the default C++11 seed_seq is the Mersenne Twister warmup sequence (although the existing implementations, libc++'s mt19937 for example, use a simpler warmup when a single-value seed is provided)

So you may be able to use your current fixed seeds with std::seed_seq to do the warm-up for you.

std::mt19937 get_prng(int seed) {
    std::seed_seq q{seed, maybe, some, extra, fixed, values};
    return std::mt19937{q};
}

Mersenne twister warm up vs. reproducibility

Possible solution #1

Possible solution #2

3 Answers3