2

I recently discovered that the _mm_crc32_* intel intrinsic instruction could be used to generate (pseudo-)random 32 bit numbers.

#include <nmmintrin.h> /* needs CRC32C instruction from SSE4.2 instruction set extension */

uint32_t rnd = 1; /* initialize with seed != 0 */

/* period length is 4,294,967,295 = 2^32-1 */
while (1) {
#if 0 // this was faster but worse than xorshift32 (fails more tests)
    // rnd = _mm_crc32_u8(rnd, rnd >> 3);
#else // this is faster and better than xorshift32 (fails fewer tests)
    rnd = _mm_crc32_u32(rnd, rnd << 18);
#endif
    printf("%08X\n", rnd);
}

This method is as fast as a LCG, and faster than xorshift32. Wikipedia says that because xorshift generators "fail a few statistical tests, they have been accused of being unreliable".

Now I am wondering if the CRC32C method passes the various tests that are done on random number generators. I only verified that each bit, even the LSB, is "random" by trying to compress using PAQ8 compressors (which failed). Can someone help me do better tests?

EDIT: Using tests from the suggested TestU01 suite, the method that I previously used turned out to be worse than xorshift32. I have updated the source code above in case anyone is interested in using the better version.

Christoph Feck
  • 466
  • 2
  • 8
  • 2
    What is "good"? Where are these numbers used? Do they need to be unpredictable by third parties? – tadman Apr 16 '18 at 22:51
  • 2
    Compressibility is a very weak test of RNG quality, especially with any general-purpose compressor. Obviously the [Kolmogorov complexity](https://en.wikipedia.org/wiki/Kolmogorov_complexity) of the stream is just the starting value + a CRC algorithm or the code in the question. Anyone that knows how you generate your random numbers can generate all future ones given one 32-bit value. – Peter Cordes Apr 16 '18 at 23:04
  • I am interested if the presented generator is worse or better than xorshift 32 in the statistical tests that are done to them. A pseudo-random number generator is always predictable, and I have no problem with it being predictable. – Christoph Feck Apr 16 '18 at 23:04
  • A *good* PRNG has an internal state larger than the return value. One result isn't the full seed for future results. xorshift32 has the same weakness, though, and just uses shift + xor on a single 32-bit integer. So that's an interesting idea; CRC32c is also based on shifts and xor. Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. BTW, why `_mm_crc32_u8` and not `_mm_crc32_u32`? And why use `rnd >> 3` instead of a constant or something? – Peter Cordes Apr 16 '18 at 23:11
  • 1
    I tested various combinations of shift direction, length, and _u8 vs. _u32. This one turned out to have the longest period, together with "_mm_crc32_u32(rnd, rnd << 18);" and "_mm_crc32_u32(rnd, rnd << 12);" – Christoph Feck Apr 16 '18 at 23:15
  • 3
    There is a description of the tests used here. Basically it looks at distribution over time and range http://www.iro.umontreal.ca/~lecuyer/myftp/papers/xorshift.pdf – Garr Godfrey Apr 16 '18 at 23:15
  • Regarding using a constant, this indeed was even faster, but only had half the period length (2^31). – Christoph Feck Apr 16 '18 at 23:17
  • On x86 with BMI2, each RNG step should only require `rorx edx, eax, 3` / `crc32 eax, dl`. On Skylake, that's 2 uops with total latency = 1 + 3 cycles. (Using a constant would make it just 3 cycles, no extra ALU uop in the critical path from CRC output back to CRC input.) Without BMI2 for copy+rotate `rorx`, `mov edx, eax` / `shr edx,3` is also only 1c latency on CPUs that have mov-elimination (Ivybridge+ and Ryzen). But it's 2 uops so there's more impact on surrounding code. *Real use cases will usually not bottleneck on the RNG's loop-carried dep chain, but on code using the results.* – Peter Cordes Apr 16 '18 at 23:19
  • Instruction latency / port usage from http://agner.org/optimize/ – Peter Cordes Apr 16 '18 at 23:23
  • @GarrGodfrey, the TestU01 test suite referenced from that paper looks useful, merci :) – Christoph Feck Apr 16 '18 at 23:44

3 Answers3

5

This is an interesting question. Ultimately the only test that counts is "does this rng yield correct results for the problem I'm working on". What are you hoping to do with the rng?

In order to avoid answering that question for every different problem, various tests have been devised. See for example the "Diehard" tests devised by George Marsaglia. A web search for "marsaglia random number generator tests" turns up several interesting links.

I think Marsaglia's work is a couple of decades old at this point. I don't know if there has been more work on this topic since then. My guess is that for non-cryptographic purposes, an rng which passes the Diehard tests is probably sufficient.

Robert Dodier
  • 16,905
  • 2
  • 31
  • 48
  • 4
    [TestU01](https://en.wikipedia.org/wiki/TestU01) is a more recent PRNG test suite. From what I understand, developers of non-cryptographic PRNGs tend to use both Diehard and TestU01. For cryptographic PRNGs there is the [NIST SP 800-22](https://csrc.nist.gov/projects/random-bit-generation/documentation-and-software) test suite. – njuffa Apr 17 '18 at 04:33
  • Thanks to all who answered. I accepted this answer, but all other comments where helpful as well. – Christoph Feck Apr 17 '18 at 21:01
3

There's a big difference in the requirements for a PRNG for a video game (especially single-player) vs. a monte-carlo simulation. Small biases can be a problem for scientific numerical computing, but generally not for a game, especially if numbers from the same PRNG are used in different ways.

There's a reason that different PRNGs with different speed / quality tradeoffs exist.


This one is very fast, especially if the seed / state stays in a register, taking only 2 or 3 uops on a modern Intel CPU. So it's fantastic if it can inline into a loop. Compared to anything else of the same speed, it's probably better quality. But compared to something only a little bit slower with larger state, it's probably pathetic if you care about statistical quality.


On x86 with BMI2, each RNG step should only require rorx edx, eax, 3 / crc32 eax, dl. On Haswell/Skylake, that's 2 uops with total latency = 1 + 3 cycles for the loop-carried dependency. (http://agner.org/optimize/). Or 3 uops without BMI2, for mov edx, eax / shr edx,3 / crc32 eax, dl, but still only 4 cycles of latency on CPUs with zero-latency mov for GP registers: Ivybridge+ and Ryzen.

2 uops is negligible impact on surrounding code in the normal case where you do enough work with each PRNG result that the 4-cycle dependency chain isn't a bottleneck. (Or ~9 cycle if your compiler stores/reloads the PRNG state inside the loop instead of keeping it live in a register and sinking the store to the global to after a loop, costing you 2 extra 1-uop instructions).

On Ryzen, crc32 is 3 uops with 3c total latency, so more impact on surrounding code but the same one per 4 clock bottleneck if you're doing so little with the PRNG results that you bottleneck on that.

I suspect you may have been benchmarking the loop-carried dependency chain bottleneck, not the impact on real surrounding code that does enough work to hide that latency. (Almost all relevant x86 CPUs are out-of-order execution.) Making an RNG even cheaper than xorshift128+, or even xorshift128, is likely to be of negligible benefit for most use-cases. xorshift128+ or xorshift128* is fast, and quite good quality for the speed.


If you want lots of PRNG results very quickly, consider using a SIMD xorshift128+ to run two or four generators in parallel (in different elements of XMM or YMM vectors). Especially if you can usefully use a __m256i vector of PRNG results. See AVX/SSE version of xorshift128+, and also this answer where I used it.


Returning the entire state as the RNG result is usually a bad thing, because it means that one value tells you exactly what the next one will be. i.e. 3 is always followed by 1897987234 (fake numbers), never 3 followed by something else. Most statistical quality tests should pick this up, but this might or might not be a problem for any given use-case.

Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. CRC32c is also based on XOR and shift (but also with bit-reflect and modulo in Galois Field(2)), so it's reasonable to think it might be similar or better in quality.

You say your choice of crc32(rnd, rnd>>3) gives a period of 2^32, and that's the best you can do with a state that small. (Of course rnd++ achieves the same period, so it's not the only measure of quality.) It's likely at least as good as an LCG, but those are not considered high quality, especially if the modulus is 2^32 (so you get it for free from fixed-width integer math).

I tested with a bitmap of previously seen values (NASM source), incrementing a counter until we reach a bitmap entry we've already seen. Indeed, all 2^32 values are seen before coming back to the original value. Since the period is exactly 0x100000000, that rules out any special values that are part of a shorter cycle.

Without memory access to the bitmap, Skylake does indeed run the loop at 4 cycles per iteration, as expected from the latency bottleneck. And rorx/crc32 is just 2 total uops.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Nice analysis of actual cycles needed! Regarding your comments "it's reasonable to think it might be similar or better in quality [than xorshift32]" and "compared to anything else of the same speed, it's probably better quality", these are the questions I was hoping to be able to verify. – Christoph Feck Apr 17 '18 at 21:24
  • @ChristophFeck: Yeah, I know I'm not actually answering the question directly, I mostly wanted to post about the throughput vs. latency issue, to point out that your testing is probably not reflecting the real cost of using a PRNG as part of a larger program. A few extra cycles of latency will make a PRNG-only loop like 1.5x slower, but still have negligible impact on surrounding code if latency isn't the bottleneck. – Peter Cordes Apr 17 '18 at 22:05
1

One measure of the goodness of a PRNG is the length of the cycle. If that matters for your application, then a CRC-32 as you are using it would not be a good choice, since the cycle is only 232. One consequence is that if you are using more samples than that, which doesn't take very long, your results will repeat. Another is that there is a correlation between successive CRC-32 values, where there is only one possible value that will follow the current value.

Better PRNGs have exponentially longer cycles, and the values returned are smaller than the bits in the state, so that successive values do not have that correlation.

You don't need to use a CRC-32C instruction to be fast. Also you don't need to design your own PRNG, which is fraught with hidden perils. Best to leave that to the professionals. See this work for high-quality, small, and fast random number generators.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158