Why do people say there is modulo bias when using a random number generator?

Question

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is "modulo bias" when using a random number generator, like rand() in C++.

score 457 · Accepted Answer · edited Mar 25 '16 at 23:16

So rand() is a pseudo-random number generator which chooses a natural number between 0 and RAND_MAX, which is a constant defined in cstdlib (see this article for a general overview on rand()).

Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let's say RAND_MAX is 10 and I decide to generate a random number between 0 and 2 by calling rand()%3. However, rand()%3 does not produce the numbers between 0 and 2 with equal probability!

When rand() returns 0, 3, 6, or 9, rand()%3 == 0. Therefore, P(0) = 4/11

When rand() returns 1, 4, 7, or 10, rand()%3 == 1. Therefore, P(1) = 4/11

When rand() returns 2, 5, or 8, rand()%3 == 2. Therefore, P(2) = 3/11

This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.

So when does rand()%n return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1. In this case, along with our earlier assumption rand() does return a number between 0 and RAND_MAX with equal probability, the modulo classes of n would also be equally distributed.

So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:

int x; 
do {
    x = rand();
} while (x >= n);

but that's inefficient for low values of n, since you only have a n/RAND_MAX chance of getting a value in your range, and so you'll need to perform RAND_MAX/n calls to rand() on average.

A more efficient formula approach would be to take some large range with a length divisible by n, like RAND_MAX - RAND_MAX % n, keep generating random numbers until you get one that lies in the range, and then take the modulus:

int x;

do {
    x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));

x %= n;

For small values of n, this will rarely require more than one call to rand().

Works cited and further reading:

Another way of thinking about_`RAND_MAX%n == n - 1`_ is `(RAND_MAX + 1) % n == 0`. When reading code, I tend to understand `% something == 0` as “evenly divisible” more readily than other ways of calculating it. _Of course, if your C++ stdlib has `RAND_MAX` as the same value as `INT_MAX`, `(RAND_MAX + 1)` surely wouldn't work; so Mark's calculation remains the safest implementation._ — Slipp D. Thompson, Jul 19 '16 at 08:04
I may be nitpicking, but if the goal is to reduce wasted bits we could improve this slightly for the edge condition where RAND_MAX (RM) is only 1 less than being equally divisible by N. In this scenario, no bits need to be wasted by doing X >= (RM - RM % N)) which is of little value for small values of N, but becomes of larger value for large values of N. As mentioned by Slipp D. Thompson, there is a solution which will work only when INT_MAX (IM) > RAND_MAX but breaks when they are equal. However, there is a simple solution for this we can amend the calculation X >= (RM - RM % N) as follows: — Ben Personick, Oct 28 '17 at 14:56
I posted an additional answer explaining the problem in detail and giving the example code solution. — Ben Personick, Oct 31 '17 at 12:08
Is the use of a loop introducing room for a side-channel attack in this case? — Rodolfo Carvalho, Nov 26 '20 at 10:54
Critcal weakness: `do { ... } while (x >= (RAND_MAX - RAND_MAX % n));` is an infinite loop when `n > RAND_MAX`. — chux - Reinstate Monica, Nov 03 '21 at 03:05

score 37 · Answer 2 · edited Jun 22 '18 at 00:29

37

Keep selecting a random is a good way to remove the bias.

Update

We could make the code fast if we search for an x in range divisible by n.

// Assumptions
// rand() in [0, RAND_MAX]
// n in (0, RAND_MAX]

int x; 

// Keep searching for an x in a range divisible by n 
do {
    x = rand();
} while (x >= RAND_MAX - (RAND_MAX % n)) 

x %= n;

The above loop should be very fast, say 1 iteration on average.

edited Jun 22 '18 at 00:29

klutt

30,332
17
55
95

answered Jun 12 '12 at 00:10

Nick Dandoulakis

42,588
16
104
136

2

Yuck :-P converting to a double, then multiplying by MAX_UPPER_LIMIT/RAND_MAX is much cleaner and performs better. – boycy Jun 13 '12 at 07:59
23

@boycy: you've missed the point. If the number of values that `rand()` can return is not a multiple of `n`, then whatever you do, you will inevitably get 'modulo bias', unless you discard some of those values. user1413793 explains that nicely (although the solution proposed in that answer is truly yucky). – TonyK Jun 17 '12 at 11:31
6

@TonyK my apologies, I did miss the point. Didn't think hard enough, and thought the bias would only apply with methods using an explicit modulus operation. Thanks for fixing me :-) – boycy Jun 18 '12 at 12:26
Operator precedence makes `RAND_MAX+1 - (RAND_MAX+1) % n` work correctly, but I still think it should be written as `RAND_MAX+1 - ((RAND_MAX+1) % n)` for clarity. – Linus Arver Oct 13 '12 at 05:07
4

This won't work if `RAND_MAX == INT_MAX` *(as it does on most systems)*. See my second comment to @user1413793 above. – BlueRaja - Danny Pflughoeft Nov 06 '12 at 22:04
This function's average case will **always** be fewer than two iterations, no matter what `RAND_MAX` and `n` are chosen to be. – Jared Nielsen Jul 03 '13 at 16:46
You can fix BlueRaja's complaint by using `UPPER_LIMIT = RAND_MAX - (RAND_MAX % n)` It will reject an extra `n` numbers unnecessarily in some cases, but it avoids overflow. – Ben Voigt Nov 18 '13 at 15:49
@BenVoigt, but then when n==RAND_MAX we'll never get `n` as a random value. See my update. – Nick Dandoulakis Nov 18 '13 at 21:14
Using `x > RAND_MAX + (-RAND_MAX-1)%n` as boundary will remove the slight inefficiency, and handles the case that RAND_MAX is equal to INT_MAX correctly. But I do agree it does not look very intuitive. – fishinear Mar 21 '16 at 16:51
1

@BlueRaja-DannyPflughoeft On most systems? I've never seen a libc implementation where `RAND_MAX` isn't `32767` -- Microsoft's Visual libc, GLibC, BSD libc, even across architechtures – cat Jun 26 '17 at 03:21

score 23 · Answer 3 · edited Sep 20 '19 at 22:06

@user1413793 is correct about the problem. I'm not going to discuss that further, except to make one point: yes, for small values of n and large values of RAND_MAX, the modulo bias can be very small. But using a bias-inducing pattern means that you must consider the bias every time you calculate a random number and choose different patterns for different cases. And if you make the wrong choice, the bugs it introduces are subtle and almost impossible to unit test. Compared to just using the proper tool (such as arc4random_uniform), that's extra work, not less work. Doing more work and getting a worse solution is terrible engineering, especially when doing it right every time is easy on most platforms.

Unfortunately, the implementations of the solution are all incorrect or less efficient than they should be. (Each solution has various comments explaining the problems, but none of the solutions have been fixed to address them.) This is likely to confuse the casual answer-seeker, so I'm providing a known-good implementation here.

Again, the best solution is just to use arc4random_uniform on platforms that provide it, or a similar ranged solution for your platform (such as Random.nextInt on Java). It will do the right thing at no code cost to you. This is almost always the correct call to make.

If you don't have arc4random_uniform, then you can use the power of opensource to see exactly how it is implemented on top of a wider-range RNG (ar4random in this case, but a similar approach could also work on top of other RNGs).

Here is the OpenBSD implementation:

/*
 * Calculate a uniformly distributed random number less than upper_bound
 * avoiding "modulo bias".
 *
 * Uniformity is achieved by generating new random numbers until the one
 * returned is outside the range [0, 2**32 % upper_bound).  This
 * guarantees the selected random number will be inside
 * [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
 * after reduction modulo upper_bound.
 */
u_int32_t
arc4random_uniform(u_int32_t upper_bound)
{
    u_int32_t r, min;

    if (upper_bound < 2)
        return 0;

    /* 2**32 % x == (2**32 - x) % x */
    min = -upper_bound % upper_bound;

    /*
     * This could theoretically loop forever but each retry has
     * p > 0.5 (worst case, usually far better) of selecting a
     * number inside the range we need, so it should rarely need
     * to re-roll.
     */
    for (;;) {
        r = arc4random();
        if (r >= min)
            break;
    }

    return r % upper_bound;
}

It is worth noting the latest commit comment on this code for those who need to implement similar things:

Change arc4random_uniform() to calculate 2**32 % upper_bound as -upper_bound % upper_bound. Simplifies the code and makes it the same on both ILP32 and LP64 architectures, and also slightly faster on LP64 architectures by using a 32-bit remainder instead of a 64-bit remainder.

Pointed out by Jorden Verwer on tech@ ok deraadt; no objections from djm or otto

The Java implementation is also easily findable (see previous link):

public int nextInt(int n) {
   if (n <= 0)
     throw new IllegalArgumentException("n must be positive");

   if ((n & -n) == n)  // i.e., n is a power of 2
     return (int)((n * (long)next(31)) >> 31);

   int bits, val;
   do {
       bits = next(31);
       val = bits % n;
   } while (bits - val + (n-1) < 0);
   return val;
 }

Note that if `arcfour_random()` actually uses the real RC4 algorithm in its implementation, the output will definitely have some bias. Hopefully your library authors have switched to using a better CSPRNG behind the same interface. I recall one of the BSDs now actually uses the ChaCha20 algorithm to implement `arcfour_random()`. More on the RC4 output biases which render it useless for security or other critical applications such as video poker: http://blog.cryptographyengineering.com/2013/03/attack-of-week-rc4-is-kind-of-broken-in.html?m=1 — rmalayter, Aug 09 '16 at 01:38
@rmalayter On iOS and OS X, arc4random reads from /dev/random which is the highest quality entropy in the system. (The "arc4" in the name is historic and preserved for compatibility.) — Rob Napier, Aug 09 '16 at 01:51
@Rob_Napier good to know, but `/dev/random` has also used RC4 on some platforms in the past (Linux uses SHA-1 in counter mode). Unfortunately the man pages I found via search indicate that RC4 is still in use on various platforms that offer `arc4random` (though the actual code may be different). — rmalayter, Aug 09 '16 at 02:36
@JonMcClung It's a very good question, but the answer is (surprisingly) no. This is a uint32_t, so if `x` is greater than 2^31, `-x` is actually positive (since it's evaluated as a signed integer in that context). Oh the glory of unsigned… As an example, -2147483650 evaluated as a UInt32 is 2147483646, and -4294967290 is 6. — Rob Napier, Mar 08 '19 at 21:46
@RobNapier aha! Thanks for explaining. But, is a two's complement representation guaranteed? Even if it is for Objective-C, this question is marked C++/language-agnostic. — Jon McClung, Mar 08 '19 at 22:07
That code is C, and it's guaranteed to work in C. By extension, it also applies to Objective-C and C++. It wouldn't work in a language that doesn't handle unsigned integers identically to C, such as Swift. (I don't think Go has that unsigned behavior either, but I can't remember.) — Rob Napier, Mar 09 '19 at 20:15
@JonMcClung `-upper_bound % upper_bound` will indeed be 0 if `int` is wider than 32-bits. It should be `(u_int32_t)-upper_bound % upper_bound)` (assuming `u_int32_t` is a BSD-ism for `uint32_t`). — Ian Abbott, Aug 15 '19 at 17:13

score 20 · Answer 4 · edited Jun 27 '22 at 10:19

Definition

Modulo Bias is the inherent bias in using modulo arithmetic to reduce an output set to a subset of the input set. In general, a bias exists whenever the mapping between the input and output set is not equally distributed, as in the case of using modulo arithmetic when the size of the output set is not a divisor of the size of the input set.

This bias is particularly hard to avoid in computing, where numbers are represented as strings of bits: 0s and 1s. Finding truly random sources of randomness is also extremely difficult, but is beyond the scope of this discussion. For the remainder of this answer, assume that there exists an unlimited source of truly random bits.

Problem Example

Let's consider simulating a die roll (0 to 5) using these random bits. There are 6 possibilities, so we need enough bits to represent the number 6, which is 3 bits. Unfortunately, 3 random bits yields 8 possible outcomes:

000 = 0, 001 = 1, 010 = 2, 011 = 3
100 = 4, 101 = 5, 110 = 6, 111 = 7

We can reduce the size of the outcome set to exactly 6 by taking the value modulo 6, however this presents the modulo bias problem: 110 yields a 0, and 111 yields a 1. This die is loaded.

Potential Solutions

Approach 0:

Rather than rely on random bits, in theory one could hire a small army to roll dice all day and record the results in a database, and then use each result only once. This is about as practical as it sounds, and more than likely would not yield truly random results anyway (pun intended).

Approach 1:

Instead of using the modulus, a naive but mathematically correct solution is to discard results that yield 110 and 111 and simply try again with 3 new bits. Unfortunately, this means that there is a 25% chance on each roll that a re-roll will be required, including each of the re-rolls themselves. This is clearly impractical for all but the most trivial of uses.

Approach 2:

Use more bits: instead of 3 bits, use 4. This yield 16 possible outcomes. Of course, re-rolling anytime the result is greater than 5 makes things worse (10/16 = 62.5%) so that alone won't help.

Notice that 2 * 6 = 12 < 16, so we can safely take any outcome less than 12 and reduce that modulo 6 to evenly distribute the outcomes. The other 4 outcomes must be discarded, and then re-rolled as in the previous approach.

Sounds good at first, but let's check the math:

4 discarded results / 16 possibilities = 25%

In this case, 1 extra bit didn't help at all!

That result is unfortunate, but let's try again with 5 bits:

32 % 6 = 2 discarded results; and
2 discarded results / 32 possibilities = 6.25%

A definite improvement, but not good enough in many practical cases. The good news is, adding more bits will never increase the chances of needing to discard and re-roll. This holds not just for dice, but in all cases.

As demonstrated however, adding an 1 extra bit may not change anything. In fact if we increase our roll to 6 bits, the probability remains 6.25%.

This begs 2 additional questions:

If we add enough bits, is there a guarantee that the probability of a discard will diminish?
How many bits are enough in the general case?

General Solution

Thankfully the answer to the first question is yes. The problem with 6 is that 2^x mod 6 flips between 2 and 4 which coincidentally are a multiple of 2 from each other, so that for an even x > 1,

[2^x mod 6] / 2^x == [2^(x+1) mod 6] / 2^(x+1)

Thus 6 is an exception rather than the rule. It is possible to find larger moduli that yield consecutive powers of 2 in the same way, but eventually this must wrap around, and the probability of a discard will be reduced.

Without offering further proof, in general using double the number of bits required will provide a smaller, usually insignificant, chance of a discard.

Proof of Concept

Here is an example program that uses OpenSSL's libcrypo to supply random bytes. When compiling, be sure to link to the library with -lcrypto which most everyone should have available.

#include <iostream>
#include <assert.h>
#include <limits>
#include <openssl/rand.h>

volatile uint32_t dummy;
uint64_t discardCount;

uint32_t uniformRandomUint32(uint32_t upperBound)
{
    assert(RAND_status() == 1);
    uint64_t discard = (std::numeric_limits<uint64_t>::max() - upperBound) % upperBound;
    RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));

    while(randomPool > (std::numeric_limits<uint64_t>::max() - discard)) {
        RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
        ++discardCount;
    }

    return randomPool % upperBound;
}

int main() {
    discardCount = 0;

    const uint32_t MODULUS = (1ul << 31)-1;
    const uint32_t ROLLS = 10000000;

    for(uint32_t i = 0; i < ROLLS; ++i) {
        dummy = uniformRandomUint32(MODULUS);
    }
    std::cout << "Discard count = " << discardCount << std::endl;
}

I encourage playing with the MODULUS and ROLLS values to see how many re-rolls actually happen under most conditions. A sceptical person may also wish to save the computed values to file and verify the distribution appears normal.

I really hope nobody has blindly copied your uniform random implementation. The `randomPool = RAND_bytes(...)` line will always result in `randomPool == 1` due to the assertion. This *always* results in a discard and a re-roll. I think you wanted to declare on a separate line. Consequently, this caused the RNG to return with `1` for every iteration. — Qix - MONICA WAS MISTREATED, Dec 22 '17 at 03:31
To be clear, `randomPool` will always evaluate to `1` according to the OpenSSL [documentation for `RAND_bytes()`](https://www.openssl.org/docs/man1.0.2/crypto/RAND_bytes.html) since it will always succeed thanks to the `RAND_status()` assertion. — Qix - MONICA WAS MISTREATED, Dec 22 '17 at 03:37

Ben Personick · Answer 5 · 2020-11-02T15:26:45.647

Mark's Solution (The accepted solution) is Nearly Perfect.

int x;

do {
    x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));

x %= n;
edited Mar 25 '16 at 23:16

Mark Amery 39k21170211

However, it has a caveat which discards 1 valid set of outcomes in any scenario where RAND_MAX (RM) is 1 less than a multiple of N (Where N = the Number of possible valid outcomes).

ie, When the 'count of values discarded' (D) is equal to N, then they are actually a valid set (V), not an invalid set (I).

What causes this is at some point Mark loses sight of the difference between N and Rand_Max.

N is a set who's valid members are comprised only of Positive Integers, as it contains a count of responses that would be valid. (eg: Set N = {1, 2, 3, ... n } )

Rand_max However is a set which ( as defined for our purposes ) includes any number of non-negative integers.

In it's most generic form, what is defined here as Rand Max is the Set of all valid outcomes, which could theoretically include negative numbers or non-numeric values.

Therefore Rand_Max is better defined as the set of "Possible Responses".

However N operates against the count of the values within the set of valid responses, so even as defined in our specific case, Rand_Max will be a value one less than the total number it contains.

Using Mark's Solution, Values are Discarded when: X => RM - RM % N

EG: 

Ran Max Value (RM) = 255
Valid Outcome (N) = 4

When X => 252, Discarded values for X are: 252, 253, 254, 255

So, if Random Value Selected (X) = {252, 253, 254, 255}

Number of discarded Values (I) = RM % N + 1 == N

 IE:

 I = RM % N + 1
 I = 255 % 4 + 1
 I = 3 + 1
 I = 4

   X => ( RM - RM % N )
 255 => (255 - 255 % 4) 
 255 => (255 - 3)
 255 => (252)

 Discard Returns $True

As you can see in the example above, when the value of X (the random number we get from the initial function) is 252, 253, 254, or 255 we would discard it even though these four values comprise a valid set of returned values.

IE: When the count of the values Discarded (I) = N (The number of valid outcomes) then a Valid set of return values will be discarded by the original function.

If we describe the difference between the values N and RM as D, ie:

D = (RM - N)

Then as the value of D becomes smaller, the Percentage of unneeded re-rolls due to this method increases at each natural multiplicative. (When RAND_MAX is NOT equal to a Prime Number this is of valid concern)

EG:

RM=255 , N=2 Then: D = 253, Lost percentage = 0.78125%

RM=255 , N=4 Then: D = 251, Lost percentage = 1.5625%
RM=255 , N=8 Then: D = 247, Lost percentage = 3.125%
RM=255 , N=16 Then: D = 239, Lost percentage = 6.25%
RM=255 , N=32 Then: D = 223, Lost percentage = 12.5%
RM=255 , N=64 Then: D = 191, Lost percentage = 25%
RM=255 , N= 128 Then D = 127, Lost percentage = 50%

Since the percentage of Rerolls needed increases the closer N comes to RM, this can be of valid concern at many different values depending on the constraints of the system running he code and the values being looked for.

To negate this we can make a simple amendment As shown here:

 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
 
 x %= n;

This provides a more general version of the formula which accounts for the additional peculiarities of using modulus to define your max values.

Examples of using a small value for RAND_MAX which is a multiplicative of N.

Mark'original Version:

RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X >= (RAND_MAX - ( RAND_MAX % n ) )
When X >= 2 the value will be discarded, even though the set is valid.

Generalized Version 1:

RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X > (RAND_MAX - ( ( RAND_MAX % n  ) + 1 ) % n )
When X > 3 the value would be discarded, but this is not a vlue in the set RAND_MAX so there will be no discard.

Additionally, in the case where N should be the number of values in RAND_MAX; in this case, you could set N = RAND_MAX +1, unless RAND_MAX = INT_MAX.

Loop-wise you could just use N = 1, and any value of X will be accepted, however, and put an IF statement in for your final multiplier. But perhaps you have code that may have a valid reason to return a 1 when the function is called with n = 1...

So it may be better to use 0, which would normally provide a Div 0 Error, when you wish to have n = RAND_MAX+1

Generalized Version 2:

int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );

    x %= n;
} else {
    x = rand();
}

Both of these solutions resolve the issue with needlessly discarded valid results which will occur when RM+1 is a product of n.

The second version also covers the edge case scenario when you need n to equal the total possible set of values contained in RAND_MAX.

The modified approach in both is the same and allows for a more general solution to the need of providing valid random numbers and minimizing discarded values.

To reiterate:

The Basic General Solution which extends mark's example:

// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
 
 x %= n;

The Extended General Solution which Allows one additional scenario of RAND_MAX+1 = n:

// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );

    x %= n;
} else {
    x = rand();
}

In some languages ( particularly interpreted languages ) doing the calculations of the compare-operation outside of the while condition may lead to faster results as this is a one-time calculation no matter how many re-tries are required. YMMV!

// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x; // Resulting random number
int y; // One-time calculation of the compare value for x

y = RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) 

if n != 0 {
    do {
        x = rand();
    } while (x > y);

    x %= n;
} else {
    x = rand();
}

Isn't it safe to say that the problem with Mark's solution is that he treats RAND_MAX and n as being the same "unit of measure" when in fact they mean two different things? While n represents the resulting "number of possibilities", RAND_MAX only represents the max value of the original possibility, where RAND_MAX + 1 would be the original number of possibilities. I'm surprised he didn't get to your conclusion since he seemed to have acknowledged n and RAND_MAX were not the same thing with the equation: `RAND_MAX%n = n - 1` — Danilo Souza Morães, Aug 11 '19 at 02:16
@DaniloSouzaMorães Thank you Danilo, You have put the matter very succinctly. I went for demonstrating what he was doing along with the Why and how of it, but don't think I was ever able to state WHAT he was doing wrong eloquently, as I get so wrapped up in the details of the logic on how and why there is an issue, that I am not stating as clearly what is at issue. Do you mind if I amend my Answer to use some of what you've written here as my own summary to the issue of what and where the accepted solution is doing what needs to be addressed near the top? — Ben Personick, Oct 16 '19 at 18:14
The last edit (2020) is IMO wrong, @BenPersonick. `y` is not used outside the `n != 0` branch and it makes no sense outside the branch due to division by zero (`... % n`). — Palec, Jan 01 '22 at 01:14
@palec y stops the need to run the static calculation more than once per runnof rhencode, as other solutions require it to run at every iteration waiting CPU cycles. I am at new years every dinner, but that is an example of how to speed up the code. Y must always be calculated once per run, creating 6 uses memoria space but means it will be one chaced memory call probably on the CPU cache per compare vs an actual CPU calculation, ut it's possible the CPU compare will also be done entirely from the cahe too, so, there may be no differerenxe, or which is fanter may be different. YMMV — Ben Personick, Jan 01 '22 at 01:31
@BenPersonick, I understand why `y` is needed, i.e. that some compilers will not hoist it out of the loop and manual hoisting is needed. I just think that the definition of `y` should take place just before the do-while loop and no earlier. Think about when `n == 0`. Happy New Year! :-) — Palec, Jan 01 '22 at 01:41
@palec Ah, no that's fine I see how, it's fair to put it inside the 1st compare, it doesn't need to be calculated in that edge case. But it was just an example of how it could be optimizer a bit. — Ben Personick, Jan 02 '22 at 03:30

score 10 · Answer 6 · edited Mar 27 '16 at 13:14

There are two usual complaints with the use of modulo.

one is valid for all generators. It is easier to see in a limit case. If your generator has a RAND_MAX which is 2 (that isn't compliant with the C standard) and you want only 0 or 1 as value, using modulo will generate 0 twice as often (when the generator generates 0 and 2) as it will generate 1 (when the generator generates 1). Note that this is true as soon as you don't drop values, whatever the mapping you are using from the generator values to the wanted one, one will occurs twice as often as the other.
some kind of generator have their less significant bits less random than the other, at least for some of their parameters, but sadly those parameter have other interesting characteristic (such has being able to have RAND_MAX one less than a power of 2). The problem is well known and for a long time library implementation probably avoid the problem (for instance the sample rand() implementation in the C standard use this kind of generator, but drop the 16 less significant bits), but some like to complain about that and you may have bad luck

Using something like

int alea(int n){ 
 assert (0 < n && n <= RAND_MAX); 
 int partSize = 
      n == RAND_MAX ? 1 : 1 + (RAND_MAX-n)/(n+1); 
 int maxUsefull = partSize * n + (partSize-1); 
 int draw; 
 do { 
   draw = rand(); 
 } while (draw > maxUsefull); 
 return draw/partSize; 
}

to generate a random number between 0 and n will avoid both problems (and it avoids overflow with RAND_MAX == INT_MAX)

BTW, C++11 introduced standard ways to the the reduction and other generator than rand().

n == RAND_MAX ? 1 : (RAND_MAX-1)/(n+1): I understand the idea here is to first divide RAND_MAX into equal page size N, then return the deviation within N, but I cannot map the code to this precisely. — zinking, Jun 15 '12 at 03:18
The naive version should be (RAND_MAX+1)/(n+1) as there is RAND_MAX+1 values to divide in n+1 buckets. If order to avoid overflow when computing RAND_MAX+1, it can be transformed in 1+(RAND_MAX-n)/(n+1). In order to avoid overflow when computing n+1, the case n==RAND_MAX is first checked. — AProgrammer, Jun 15 '12 at 06:42
+plus, doing divide is seeming costing more even compared with regenerate numbers. — zinking, Jun 15 '12 at 08:56
Taking the modulo and dividing have the same cost. Some ISA even provide just one instruction which provide always both. The cost of regenerating numbers will depend on n and RAND_MAX. If n is small in respect to RAND_MAX, it may cost a lot. And obviously you may decide the the biases isn't important for your application; I just give a way to avoid them. — AProgrammer, Jun 15 '12 at 09:10

Rivenfall · Answer 7 · 2016-08-11T10:14:29.250

With a RAND_MAX value of 3 (in reality it should be much higher than that but the bias would still exist) it makes sense from these calculations that there is a bias:

1 % 2 = 1 2 % 2 = 0 3 % 2 = 1 random_between(1, 3) % 2 = more likely a 1

In this case, the % 2 is what you shouldn't do when you want a random number between 0 and 1. You could get a random number between 0 and 2 by doing % 3 though, because in this case: RAND_MAX is a multiple of 3.

Another method

There is much simpler but to add to other answers, here is my solution to get a random number between 0 and n - 1, so n different possibilities, without bias.

the number of bits (not bytes) needed to encode the number of possibilities is the number of bits of random data you'll need
encode the number from random bits
if this number is >= n, restart (no modulo).

Really random data is not easy to obtain, so why use more bits than needed.

Below is an example in Smalltalk, using a cache of bits from a pseudo-random number generator. I'm no security expert so use at your own risk.

next: n

    | bitSize r from to |
    n < 0 ifTrue: [^0 - (self next: 0 - n)].
    n = 0 ifTrue: [^nil].
    n = 1 ifTrue: [^0].
    cache isNil ifTrue: [cache := OrderedCollection new].
    cache size < (self randmax highBit) ifTrue: [
        Security.DSSRandom default next asByteArray do: [ :byte |
            (1 to: 8) do: [ :i |    cache add: (byte bitAt: i)]
        ]
    ].
    r := 0.
    bitSize := n highBit.
    to := cache size.
    from := to - bitSize + 1.
    (from to: to) do: [ :i |
        r := r bitAt: i - from + 1 put: (cache at: i)
    ].
    cache removeFrom: from to: to.
    r >= n ifTrue: [^self next: n].
    ^r

Peter O. · Answer 8 · 2023-02-14T23:19:42.603

1

Modulo reduction is a commonly seen way to make a random integer generator avoid the worst case of running forever.

When the range of possible integers is unknown, however, there is no way in general to "fix" this worst case of running forever without introducing bias. It's not just modulo reduction (rand() % n, discussed in the accepted answer) that will introduce bias this way, but also the "multiply-and-shift" reduction of Daniel Lemire, or if you stop rejecting an outcome after a set number of iterations. (To be clear, this doesn't mean there is no way to fix the bias issues present in pseudorandom generators. For example, even though modulo and other reductions are biased in general, they will have no issues with bias if the range of possible integers is a power of 2 and if the random generator produces unbiased random bits or blocks of them.)

The following answer of mine discusses the relationship between running time and bias in random generators, assuming we have a "true" random generator that can produce unbiased and independent random bits. The answer doesn't even involve the rand() function in C because it has many issues. Perhaps the most serious here is the fact that the C standard does not explicitly specify a particular distribution for the numbers returned by rand(), not even a uniform distribution.

How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?

edited Feb 14 '23 at 23:19

answered Jul 14 '20 at 20:09

Peter O.

32,158
14
82
96

Aside from taking care of a shifted range which should have no bearing on OP's Question, (Which IMP in all the answers here including this one only seems to serve to muddy the waters on what is being accomplished). That said this code is seems to just be addressing the same underlying cause of modulus bias itself which is that the RAND_MAX will always be a power of 2, and so when the SET is NOT a Power of 2 then you must discard the values falling into the bad set. This is addressed in my and the accepted answer, but you seem to think it is not.. – Ben Personick Jan 06 '21 at 20:23
@BenPersonick: My answer says there is no way to "fix" the worst case _of running forever_ without introducing bias, not that there is no way to fix the bias issues present with pseudorandom generators. When the range of integers is unknown, the bias issue can only be solved, in general, through rejection sampling, such as techniques given in your answer or this one, and rejection sampling has an unbounded worst case running time. I will clarify this answer. – Peter O. Jan 07 '21 at 00:24
Ah, I gotcha, that was not eminently clear to me that your point was to bring up the implicit issue all of our code presents. Although, practically speaking your chances of it running forever are fairly minute unless the underlying psuedorandum number generation has significant bias. Each round has a chance of being a discard never actually reaching 50%, – Ben Personick Jan 07 '21 at 02:26
Ie. `2^(N-1)-1` is the max discard (where `N` is the power of 2 that represents the set of ourcomes `RAND_MAX` --- i3 `2^N` is the the count of the set of values that the random function may return while `RAND_MAX` is `2^N-1` ) Thus for ease of review we will call the maximum chance of discard 1/2 each round. Could this go on forever? Yes, it is possible, but, will it? It is exceedingly improbable. – Ben Personick Jan 07 '21 at 03:24
@BenPersonick: Yes, rejection sampling can be implemented in constant _expected_ time as you mention. – Peter O. Jan 07 '21 at 03:26
So for our argument using 1/2 chances (aka 50% probability) we have (surprise) another power of 2. Your probability of having a discarded set `X` times is `1/(2^X)` (eg 1 discard `1/(2^1)` 50% -- 4 discards is `1/16` 6.25% -- 10 discards `1/1024` 0.00098% -- to be discarded 100 time in a row would be `1/(2^100)` which is 7.88860905E−31 % in fact you cab add the percentages of each roll, to see that `93.75% of attemps would discard no more than 4 times` and that `96.87%` would discard no more than 5 times. `98.43 %`within 6 tries, 99.88% chance of no more than 10 discards. This is worst cas – Ben Personick Jan 07 '21 at 03:57

score -1 · Answer 9 · edited May 23 '17 at 12:26

-1

As the accepted answer indicates, "modulo bias" has its roots in the low value of RAND_MAX. He uses an extremely small value of RAND_MAX (10) to show that if RAND_MAX were 10, then you tried to generate a number between 0 and 2 using %, the following outcomes would result:

rand() % 3   // if RAND_MAX were only 10, gives
output of rand()   |   rand()%3
0                  |   0
1                  |   1
2                  |   2
3                  |   0
4                  |   1
5                  |   2
6                  |   0
7                  |   1
8                  |   2
9                  |   0

So there are 4 outputs of 0's (4/10 chance) and only 3 outputs of 1 and 2 (3/10 chances each).

So it's biased. The lower numbers have a better chance of coming out.

But that only shows up so obviously when RAND_MAX is small. Or more specifically, when the number your are modding by is large compared to RAND_MAX.

A much better solution than looping (which is insanely inefficient and shouldn't even be suggested) is to use a PRNG with a much larger output range. The Mersenne Twister algorithm has a maximum output of 4,294,967,295. As such doing MersenneTwister::genrand_int32() % 10 for all intents and purposes, will be equally distributed and the modulo bias effect will all but disappear.

edited May 23 '17 at 12:26

Community

1
1

answered Apr 15 '13 at 02:19

bobobobo

64,917
62
258
363

3

Yours is more efficient and it probably is true that if RAND_MAX is significantly bigger then the number you are modding by, however yours will still be biased. Granted these are all pseudo random number generators anyways and that in and of itself is a different topic but if you assume a fully random number generator, your way still biases the lower values. – user1413793 Apr 16 '13 at 03:09
Because the highest value is odd, `MT::genrand_int32()%2` picks 0 (50 + 2.3e-8)% of the time and 1 (50 - 2.3e-8)% of the time. Unless you're building a casino's RGN (which you probably would use a much larger range RGN for), any user is not going to notice an extra 2.3e-8% of the time. You're talking about numbers too small to matter here. – bobobobo Apr 16 '13 at 04:08
8

Looping is the best solution. It is not "insanely inefficient"; requiring less than twice the iterations in worst average case. Using a high `RAND_MAX` value will decrease the modulo bias, but not eliminate it. Looping will. – Jared Nielsen Jul 03 '13 at 16:22
6

If `RAND_MAX` is sufficiently bigger than the number you are modding by, the number of times you need to regenerate the random number is vanishingly small and won't affect the efficiency. I say keep the looping, as long as you're testing against the largest multiple of `n` rather than just `n` as proposed by the accepted answer. – Mark Ransom Apr 08 '15 at 01:08

score -1 · Answer 10 · answered Aug 11 '23 at 06:24

-1

I want random doubles a lot with various software. I find the range more "random" if I use ((double)rand() / RAND_MAX). So I'm guessing if you multiply that by your number range you can get a less bias random number?

i.e. ((double)rand() / RAND_MAX) * 3.

I read an answer about doing a random number out of 2. isodd(rand())?

answered Aug 11 '23 at 06:24

Tangata rereke

1
1

This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/34827626) – YesThatIsMyName Aug 17 '23 at 11:52

Yavuz Koroglu · Answer 11 · 2014-04-09T17:37:03.337

-3

I just wrote a code for Von Neumann's Unbiased Coin Flip Method, that should theoretically eliminate any bias in the random number generation process. More info can be found at (http://en.wikipedia.org/wiki/Fair_coin)

int unbiased_random_bit() {    
    int x1, x2, prev;
    prev = 2;
    x1 = rand() % 2;
    x2 = rand() % 2;

    for (;; x1 = rand() % 2, x2 = rand() % 2)
    {
        if (x1 ^ x2)      // 01 -> 1, or 10 -> 0.
        {
            return x2;        
        }
        else if (x1 & x2)
        {
            if (!prev)    // 0011
                return 1;
            else
                prev = 1; // 1111 -> continue, bias unresolved
        }
        else
        {
            if (prev == 1)// 1100
                return 0;
            else          // 0000 -> continue, bias unresolved
                prev = 0;
        }
    }
}

edited Apr 09 '14 at 17:37

answered Apr 09 '14 at 17:31

Yavuz Koroglu

1
1

1

This doesn't address modulo bias. This process could be used to eliminate bias in a bit stream. However, to get from a bit stream to an even distribution from 0 to n where n is not one less than a power of two requires addressing modulo bias. Thus this solution cannot eliminate *any bias in the random number generation process.* – Rick Aug 05 '15 at 13:06
3

@Rick hmm. The logical extension of Von Neumann's method to eliminating modulo bias when generating a random number between, say, 1 and 100, would be: A) call `rand() % 100` 100 times. B) if all the results are different, take the first one. C) otherwise, GOTO A. This will work, but with an expected number of iterations of about 10^42, you will have to be quite patient. And immortal. – Mark Amery Mar 27 '16 at 11:25
@MarkAmery Indeed that should work. Looking over this algorithm though it's not correctly implemented. The first else should be: `else if(prev==2) prev= x1; else { if(prev!=x1) return prev; prev=2;}` – Rick Mar 28 '16 at 12:58