5

Modulo bias is a problem that arises when naively using the modulo operation to get pseudorandom numbers smaller than a given "upper bound".

Therefore as a C programmer I am using a modified version of the arc4random_uniform() function to generate evenly distributed pseudorandom numbers.

The problem is I do not understand how the function works, mathematically.

This is the function's explanatory comment, followed by a link to the full source code:

/*
 * Calculate a uniformly distributed random number less than upper_bound
 * avoiding "modulo bias".
 *
 * Uniformity is achieved by generating new random numbers until the one
 * returned is outside the range [0, 2**32 % upper_bound).  This
 * guarantees the selected random number will be inside
 * [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
 * after reduction modulo upper_bound.
 */

http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/crypt/arc4random_uniform.c?rev=1.1&content-type=text/x-cvsweb-markup

From the comment above we can define:

  • [2^32 % upper_bound, 2^32) - interval A
  • [0, upper_bound) - interval B

In order to work, the function relies on the fact that interval A maps to interval B.

My question is: mathematically, how come the numbers in interval A map uniformly to the ones in interval B? And is there a proof for this?

Log
  • 239
  • 1
  • 5
  • 1
    May I suggest this reading: http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique/ – ouah Aug 27 '15 at 20:40
  • "Generating new random numbers until..." is a poor technique. I don't have the answer to hand, but it is better to scale the random number to the required range, than reject and waste time. Is this any use? http://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator The whole idea of random numbers is fraught with difficulty, it is easy to confuse "random" with "evenly distributed". – Weather Vane Aug 27 '15 at 20:48
  • "... it is better to scale the random number to the required range, ..." This is in fact not possible :-) For instance, try sampling an integer uniformly from the set {1, 2, 3, 4, 5} using a single throw of a dice. – m7thon Aug 28 '15 at 00:29
  • @m7thon doesn't a die have 6 sides? – Weather Vane Aug 28 '15 at 00:48
  • @Weather yes, and ideally it lands on each of these with equal probability. – m7thon Aug 28 '15 at 00:56

1 Answers1

7

Sometimes it helps to start with an easily understood example, and then generalize from there. To keep things simple, let's imagine that arc4random returns a uint8_t instead of a uint32_t, so the output from arc4random is a number in the interval [0,256). And let's choose an upper_bound of 7.

Note that 7 does not divide evenly into 256

256 = 7 * 36 + 4

That means that naively using the modulo operation to get pseudorandom numbers smaller than 7 would result in the following probability distribution

37/256 for outcomes 0,1,2,3
36/256 for outcomes 4,5,6

That's what's known as modulo bias, outcomes 0,1,2,3 are more likely than outcomes 4,5,6.

To avoid modulo bias we could simply reject the values 252,253,254,255, and generate a new number until the result is in the interval [0,252). All the numbers in the interval [0,252) have equal probability (rejecting higher numbers doesn't effect the distribution of the lower numbers). And since 7 divides evenly into 252, the resulting probability distribution is uniform

 36/252 for outcomes 0,1,2,3,4,5,6,7

That's essentially what arc4random_uniform does, except that arc4random_uniform rejects numbers at the bottom of the range. Specifically, interval A would be

[2^8 % 7, 2^8) which is [4, 256)

After generating a number (call it N) in the interval [4,256) the final calculation is

outcome = N % 7

There are 252 numbers in the interval [4,256), and since 252 is a multiple of 7, every outcome on the interval [0,7) has equal probability.


That's how arc4random_uniform works, it rejects/retries on a small range of numbers, and the count of numbers in the remaining range is a multiple of the upper_bound. (Since the upper_bound is typically a small number compared to 2^32, the odds of having multiple retries for a single outcome is quite small.)

But do you really care about modulo bias? In most cases, the answer is, "No". Consider our example with an upper bound of 7. The probability distribution for the naive modulo implementation is

613566757 / 4294967296 for outcomes 0,1,2,3
613566756 / 4294967296 for outcomes 4,5,6

which is a modulo bias of less than 0.0000002%.

So there's your choice: either spend a minuscule amount of time on retries to get a perfect distribution, or accept a minuscule error in the probability distribution to avoid the retries.

user3386109
  • 34,287
  • 7
  • 49
  • 68
  • You can just compute `outcome = N % 7` for a number `N` from the interval `[4, 256)` in your example, no need to subtract 4. This is true in general. Subtracting before taking modulo just shifts the resulting random number, but does not change uniformity. – m7thon Aug 28 '15 at 00:19
  • @m7thon Yes, you are correct, of course. I've updated the answer, thanks! – user3386109 Aug 28 '15 at 01:01