Effective Java Item 47: Know and use your libraries - Flawed random integer method example

Question

In the example Josh gives of the flawed random method that generates a positive random number with a given upper bound n, I don't understand the two of the flaws he states.

The method from the book is:

private static final Random rnd = new Random();

//Common but deeply flawed
static int random(int n) {
    return Math.abs(rnd.nextInt()) % n;
}

He says that if n is a small power of 2, the sequence of random numbers that are generated will repeat itself after a short period of time. Why is this the case? The documentation for Random.nextInt() says Returns the next pseudorandom, uniformly distributed int value from this random number generator's sequence. So shouldn't it be that if n is a small integer then the sequence will repeat itself, why does this only apply to powers of 2?
Next he says that if n is not a power of 2, some numbers will be returned on average more frequently than others. Why does this occur, if Random.nextInt() generates random integers that are uniformly distributed? (He provides a code snippet which clearly demonstrates this but I don't understand why this is the case, and how this is related to n being a power of 2).

I'm amused that the author overlooked the biggest flaw: this code will sometimes return negative numbers! — Mooing Duck, Jan 05 '15 at 22:53
The docs for `Math.abs` states: "Note that if the argument is equal to the value of `Integer.MIN_VALUE`, the most negative representable int value, the result is that same value, which is negative." http://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#abs(int) (I also confirmed that `rnd.nextInt()` can indeed return `Integer.MIN_VALUE`). A negative value modulo a positive `n` results in a negative value. — Mooing Duck, Jan 06 '15 at 00:01
@Mooing, actually he does state the flaw you pointed out, I just didn't ask about it — Derek Mok, Jan 06 '15 at 09:15

score 38 · Accepted Answer · edited Apr 13 '17 at 12:48

Question 1: if n is a small power of 2, the sequence of random numbers that are generated will repeat itself after a short period of time.

This is not a corollary of anything Josh is saying; rather, it is simply a known property of linear congruential generators. Wikipedia has the following to say:

A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2. In general, the n-th least significant digit in the base b representation of the output sequence, where b^k = m for some integer k, repeats with at most period bⁿ.

This is also noted in the Javadoc:

Linear congruential pseudo-random number generators such as the one implemented by this class are known to have short periods in the sequence of values of their low-order bits.

The other version of the function, Random.nextInt(int), works around this by using different bits in this case (emphasis mine):

The algorithm treats the case where n is a power of two specially: it returns the correct number of high-order bits from the underlying pseudo-random number generator.

This is a good reason to prefer Random.nextInt(int) over using Random.nextInt() and doing your own range transformation.

Question 2: Next he says that if n is not a power of 2, some numbers will be returned on average more frequently than others.

There are 2³² distinct numbers that can be returned by nextInt(). If you try to put them into n buckets by using % n, and n isn't a power of 2, some buckets will have more numbers than others. This means that some outcomes will occur more frequently than others even though the original distribution was uniform.

Let's look at this using small numbers. Let's say nextInt() returned four equiprobable outcomes, 0, 1, 2 and 3. Let's see what happens if we applied % 3 to them:

0 maps to 0
1 maps to 1
2 maps to 2
3 maps to 0

As you can see, the algorithm would return 0 twice as frequently as it would return each of 1 and 2.

This does not happen when n is a power of two, since one power of two is divisible by the other. Consider n=2:

0 maps to 0
1 maps to 1
2 maps to 0
3 maps to 1

Here, 0 and 1 occur with the same frequency.

Additional resources

Here are some additional -- if only tangentially relevant -- resources related to LCGs:

Spectral tests are statistical tests used to assess the quality of LCGs. Read more here and here.
A collection of classical pseudorandom number generators with linear structures has some pretty scatterplots (the generator used in Java is called DRAND48).
There is an interesting discussion on crypto.SE about predicting values from Java's generator.

I realize I'm three years late here, but just wanted to chime in that, while the effect of dividing 2^32 values amongst 3 bins will result in nearly-negligible differences between the bin sizes, it becomes far more noticeable if you increase the number of bins. For example, `3 * (Integer.MAX_VALUE / 4)` bins will result in ~1/3 of the bins will end up with double the amount of entries, on average. — Ironcache, Feb 20 '18 at 15:12

Dima · Answer 2 · 2015-01-05T12:26:23.240

5

1) When n is a power of 2, rnd % n is equivalent to selecting a few lower bits of the original. Lower bits of numbers generated by the type of generators used by java are known to be "less random" than the higher bits. It's just the property of the formula used for generating the numbers.

2) Imagine, that the largest possible value, returned by random() is 10, and n = 7. Now doing n % 7 maps numbers 7, 8, 9 and 10 into 0, 1, 2, 3 respectively. Therefore, if the original number is uniformly distributed, the result will be heavily biased towards the lower numbers, because they will appear twice as often as 4, 5 and 6. In this case, this does happen regardless of whether n is a power of two or not, but, if instead of 10 we chose, say, 15 (which is 2^4-1), then any n, that is a power of two would result in a uniform distribution, because there would be no "excess" numbers left at the end of the range to cause bias, because the total number of possible values would be exactly divisible by the number of possible remainders.

edited Jan 05 '15 at 12:26

answered Jan 05 '15 at 12:19

Dima

39,570
6
44
70

1

Personally I think the second claim is more or less complete tosh. The maximum value isn't 10, it's 2^32-1, so worst case (on average) you might get a +/-1 discrepancy in the number of items per bin. The number of times the "left-over" numbers can appear will be very small, e.g. if n = 100 then there's only a minute fraction of a %age chance that they'll even be picked. – Alnitak Jan 05 '15 at 12:27
Yes, I corrected the wording ... you caught me in the middle of editing it :) – Dima Jan 05 '15 at 12:27
2

@Alnitak, yes, for small `n` the discrepancy is not very noticeable. By if `n` is something like `2*Integer.MAX_INT/3` for example, you will get numbers in the lower half of the range appear twice as often as the others. – Dima Jan 05 '15 at 12:29
1

Yes, it would need to be in that order of magnitude for it to make any significant difference. That's a highly unusual use (IME) for an RNG, though. – Alnitak Jan 05 '15 at 13:38
@Alnitak, no, outside of college-level "toy programs", the large ranges for random numbers have actually more use than smaller ones. – Dima Jan 05 '15 at 14:24

Effective Java Item 47: Know and use your libraries - Flawed random integer method example

2 Answers2

Linked