Very fast uniform distribution random number generator

Question

As part of a Monte Carlo simulation, I have to roll a group of dice until certain values show up a certain amount of times. My code that does this calls upon a dice class which generates a random number between 1 and 6, and returns it. Originally the code looked like

public void roll() {
    value = (int)(Math.random()*6) + 1;
}

and it wasn't very fast. By exchanging Math.random() for

ThreadLocalRandom.current().nextInt(1, 7);

It ran a section in roughly 60% of the original time, which called this around about 250 million times. As part of the full simulation it will call upon this method billions of times at the very least, so is there any faster way to do this?

Your result gives you a hint: parallelization should extend to your whole algorithm, not just the die rolls. — duffymo, Nov 05 '14 at 02:26
I'm not sure I can run the algorithm in chunks, because the number of times it runs depends on the rolled values, it could terminate after calling the random function 6 times — spyr03, Nov 05 '14 at 02:39
What *is* the algorithm? As duffymo indicated, if you divide the work at the top level then you can have multiple threads crunch numbers in parallel. Without seeing the actual algorithm/problem we can only offer general advice and wild guesses. — DarthGizka, Nov 05 '14 at 09:18
I see, the algorithm is to take a group of dice of n size, roll them until a condition has occured, when it occurs count how many rolls it took, and repeat this up to 1,000,000 times, which I guess could be threaded. The results are graphed, so the more iterations it undergoes the more accurate the results are, and such the better the graph. the point of the graph is to get a rough idea what big O complexity it is. — spyr03, Nov 05 '14 at 15:53

DarthGizka · Accepted Answer · 2014-11-05T11:15:02.797

17

Pick a random generator that is as fast and as good as you need it to be, and that isn't slowed down to a tiny fraction of its normal speed by thread safety mechanisms. Then pick a method of generating the [1..6] integer distribution that is a fast and as precise as you need it to be.

The fastest simple generator that is of sufficiently high quality to beat standard tests for PRNGs like TestU01 (instead of failing systematically, like the Mersenne Twister) is Sebastiano Vigna's xorshift64*. I'm showing it as C code but Sebastiano has it in Java as well:

uint64_t xorshift64s (int64_t &x)
{
   x ^= x >> 12;
   x ^= x << 25;
   x ^= x >> 27;

   return x * 2685821657736338717ull;
}

Sebastiano Vigna's site has lots of useful info, links and benchmark results. Including papers, for the mathematically inclined.

At that high resolution you can simply use 1 + xorshift64s(state) % 6 and the bias will be immeasurably small. If that is not fast enough, implement the modulo division by multiplication with the inverse. If that is not fast enough - if you cannot afford two MULs per variate - then it gets tricky and you need to come back here. xorshift1024* (Java) plus some bit trickery for the variate would be an option.

Batching - generating an array full of numbers and processing that, then refilling the array and so on - can unlock some speed reserves. Needlessly wrapping things in classes achieves the opposite.

P.S.: if ThreadLocalRandom and xorshift* are not fast enough for your purposes even with batching then you might be going about things in the wrong way, or you might be doing it in the wrong language. Or both.

P.P.S.: in languages like Java (or C#, or Delphi), abstraction is not free, it has a cost. In Java you also have to reckon with things like mandatory gratuitous array bounds checking, unless you have a compiler that can eliminate those checks. Teasing high performance out of a Java program can get very involved... In C++ you get abstraction and performance for free.

edited Nov 05 '14 at 11:15

answered Nov 05 '14 at 06:59

DarthGizka

4,347
1
24
36

1

There are further optimzations: first, use the 64-bit Xorshift to fill a ring buffer with random bytes, but then grab the bytes one at a time when actually rolling a die (actually, you only need three random bits, but grabbing bytes eliminates a lot of shifting and masking). Then, use rejection sampling to get your 3-bit value squeezed down to 1..6. This will reject 25% of bytes, but it eliminates division and is still getting 6 rolls for each 64-bit pass of Xorshift. In C I can simulate *billions* of hands of blackjack in under a minute. – Lee Daniel Crocker Nov 05 '14 at 17:27
Right, that neatly solves the division problem. However, if we were talking C/asm level performance here you would trade it for a 25% chance at a mispredicted jump, which costs about five times as much as a MUL... I wish someone really needed something like 10^9 die rolls per second, so that we had an excuse for pushing the limits. :-) In any case your solution is very, very hard to beat. I like it. – DarthGizka Nov 05 '14 at 17:41
Dang thing doesn't let me edit my comment... I want to revise my assessment of Lee's solution to 'fanta-fricken-tastic and quite probably impossible to beat'. Credit where credit is due. – DarthGizka Nov 05 '14 at 17:56
Author asked for *uniform* random numbers. @DarthGizka, may be you can say something about generation such numbers with xorshift? – kelin Jun 02 '16 at 12:11
@kelin: The author asked about the performance of generating uniform *integer* numbers, which is what we've all been talking about. If your question is about generating of uniform *floating point* numbers then that's a different kettle of fish, and a comment is too short for a reasonable treatment (beyond saying that care needs to be taken because naive approaches tend to produce biassed results). You really should ask this as a separate question - provided it hasn't been asked already. – DarthGizka Jun 02 '16 at 14:14
@kelin: The question has been asked several times already; see [How do you generate a random double uniformly distributed between 0 and 1 from C++?](http://stackoverflow.com/a/26867455/4156577), for example, where my answer has links to relevant papers. – DarthGizka Jun 02 '16 at 14:57
@DarthGizka, thank you. Both integer and float numbers are interesting to me. – kelin Jun 02 '16 at 19:40

score 2 · Answer 2 · answered Nov 05 '14 at 17:38

Darth is correct that Xorshift* is probably the best generator to use. Use it to fill a ring buffer of bytes, then fetch the bytes one at a time to roll your dice, refilling the buffer when you've fetched enough. To get the actual die roll, avoid division and bias by using rejection sampling. The rest of the code then looks something like this (in C):

do {
    if (bp >= buffer + sizeof buffer) {
        // refill buffer with Xorshifts
    }
    v = *bp++ & 7;
} while (v > 5);
return v;

This will allow you to get on average 6 die rolls per 64-bit random value.

Very fast uniform distribution random number generator

2 Answers2

Linked