0

I am trying to generate uniformly distributed random numbers with the mt19937 engine and std::random_device as the seed source. If I get lucky I get a couple hundred thousands unique number out of the 4 billion possible values. I was wondering if it could get any better than this.

I attempted to increase the entropy using high resolution timer and the random device using seed_seq (https://stackoverflow.com/a/34493057/5852409) also tried initializing all the 624 states of the mt19937 (https://codereview.stackexchange.com/a/109266). However, did not see any improvement.

#include <random>
#include <iostream>
#include <set>

void main()
{
    std::random_device rd;
    std::mt19937 engn(rd());
    std::uniform_int_distribution<unsigned int> unidist(0, 0xFFFFFFFF - 1);

    std::set<unsigned int> s;
    auto itr = s.insert(unidist(engn));
    int k = 0;
    while (itr.second)
    {
        itr = s.insert(unidist(engn));
        k++;
    }
    std::cout << k << '\n';
}
devbin
  • 61
  • 7
  • After you generated a sample from your distribution, the probability for any one number in the range is ``1/range``. Including the number you were just drawing. As such, the code you give, as I understand it does not say much about the quality of the random number generator. To see if the distribution is truly uniform and does not have clusters and gaps, you need to run the generator for a while, then perform a statistical test to see the probability that this is truly a sample set from a uniform distribution. – BitTickler Jan 16 '18 at 23:35
  • See for example here: https://math.stackexchange.com/questions/2435/is-there-a-simple-test-for-uniform-distributions – BitTickler Jan 16 '18 at 23:38
  • 1
    You cannot simultaneously have unique numbers and random numbers. The more unique numbers you generate, the weaker the randomness of them will be. – Nicol Bolas Jan 17 '18 at 04:55

1 Answers1

2

First and foremost, make sure you understand the birthday paradox. I.e. the fact that you get a duplicate after some ten or hundred thousand numbers does not indicate a statistical deficiency in the mt19937.

As a rough estimate due to this paradox, duplicates become likely after about the square root of possible values even for a perfect random generator, in this case after about 2^16 = 65536 values.

Second, note that entropy does not mean uniqueness of outputs. Imagine throwing a perfectly fair 100-sided die 100 times. The likelihood that at least one value appears twice is actually much greater than the likelihood that each value is seen exactly once. Entropy is a measure for the number of states in a system. Entropy in your case relates to the quality of your seed (covering many different initial states of the PRNG), not the uniqueness of outputs.

Third, if you have a use case where you must ensure uniqueness (of IDs or handles, for example), but you need poor predictability aka randomness, you have basically two options:

  • Store "taken" values and "re-roll" as long as necessary. There are also probabilistic algorithms for this that can detect duplicates with much less RAM, at the cost of a small probability of false positives.
  • Use a much larger – more than twice as many bits – handle space and hope that no collision will occur. This is appropriate when occasional collisions are unwanted but do limited harm, such as leading to an expensive theoretically unnecessary recalculation.
Arne Vogel
  • 6,346
  • 2
  • 18
  • 31