0

I've often heard that you should never mod the result of your random number generator if you want a uniform distribution. However, I've seen that using a std::uniform_int_distribution makes no difference for significantly small ranges.

Below is an example using both mod and uniform_int_distribution for values 0 - 15:

std::mt19937 gen;
gen.seed(0);

int ROWS = 6;
int COLS = 10;

std::cout << "mod: \n";
for (size_t i = 0; i < ROWS; ++i){
    for (size_t j = 0; j < COLS; ++j){
        std::cout << std::setw(2) << gen() % 16 << " ";
    }
    std::cout << "\n";
}
std::cout << "\n";

gen.seed(0);
std::uniform_int_distribution<> distrib(0, 15);

std::cout << "dist: \n";
for (size_t i = 0; i < ROWS; ++i){
    for (size_t j = 0; j < COLS; ++j){
        std::cout << std::setw(2) << distrib(gen) << " ";
    }
    std::cout << "\n";
}

results:

mod: 
12 15  5  0  3 11  3  7  9  3 
 5  2  4  7  6  8  8 12 10  1 
 6  7  7 14  8  1  5  9 13  8 
 9  4  3  0  3  5 14 15 15  0 
 2  3  8  1  3 13  3  3 14  7 
 0  1  9  9 15  0 15 10  4  7

dist: 
12 15  5  0  3 11  3  7  9  3 
 5  2  4  7  6  8  8 12 10  1 
 6  7  7 14  8  1  5  9 13  8 
 9  4  3  0  3  5 14 15 15  0 
 2  3  8  1  3 13  3  3 14  7 
 0  1  9  9 15  0 15 10  4  7

I guess it has something to do with 2 bytes? I'm just wondering how this is valid mathematically since its stepping through the random number generator and modding results. Does this mean mod creates a uniform distribution if the range is small enough? And why a 2 byte range and not more?

Trevor Hickey
  • 36,288
  • 32
  • 162
  • 271
  • 3
    Typically, at least I haven't seen an implementation where it is not, the domain of `[0, RAND_MAX]` is a multiple of `2`. `16` is also a multiple of `2`, so moding the result of `rand` with it should maintain the distribution. Try using a non multiple of 2 and see if you get different results. – NathanOliver Oct 15 '21 at 18:02
  • Your test is insuffcient to rate a _uniform distribution_. Try a couple billion samples. – chux - Reinstate Monica Oct 15 '21 at 18:03
  • 1
    The first problem here is "I've often heard". From whom? Where? Citation needed! Also, what are you using said numbers for? If you need them for cryptography, then OK, other assumptions apply. – Kevin Anderson Oct 15 '21 at 18:04
  • 1
    @NathanOliver More like " Try using a non multiple power of 2" – chux - Reinstate Monica Oct 15 '21 at 18:07
  • `RAND_MAX` may be a low as 32767, max signed 16 bit integer, making it completely unsuitable for ranges larger than that. Just how suitable it is for smaller ranges depends on the implementation of `rand`, but `rand` is very loosely specified. [This XKCD joke](https://xkcd.com/221/) can be frighteningly accurate. – user4581301 Oct 15 '21 at 18:44
  • 1
    Does this answer your question? [Why do people say there is modulo bias when using a random number generator?](https://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator) – Peter O. Oct 15 '21 at 18:46
  • See also: https://stackoverflow.com/questions/69276475/are-lsb-bits-less-random-how-does-it-work-and-what-is-the-best-lcg-implementat/69276573#69276573 – Peter O. Oct 15 '21 at 18:47

1 Answers1

2

Using the modulo operator will frequently introduce a bias into the returned results when the number of unique values returned by your source of random bits is not a multiple of the divisor.

As a simple example, if your random source returns 4 bits (0-15) and you want values in the range 0-2, using gen() % N you'll get 6 0s, 5 1s, and 5 2s. This biases your results to the low side.

Using multiply-then-divide (gen() * N / RANGE) can still leave an imbalance in the specific number of each result returned, but the imbalance will be spread out evenly among the results which reduces or eliminates the low bias. It also has to contend with overflow in the multiplication. With the previous example, you'll get 5 0s, 6 1s, and 5 0s.

A third alternative would be to check the returned bits to see if the value is among the highest result (that would result in the bias) and regenerate the random bits if this is the case. This introduces a conditional in the code and the time to generate a random number is open ended (rather than fixed).

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56