There's a big difference in the requirements for a PRNG for a video game (especially single-player) vs. a monte-carlo simulation. Small biases can be a problem for scientific numerical computing, but generally not for a game, especially if numbers from the same PRNG are used in different ways.
There's a reason that different PRNGs with different speed / quality tradeoffs exist.
This one is very fast, especially if the seed / state stays in a register, taking only 2 or 3 uops on a modern Intel CPU. So it's fantastic if it can inline into a loop. Compared to anything else of the same speed, it's probably better quality. But compared to something only a little bit slower with larger state, it's probably pathetic if you care about statistical quality.
On x86 with BMI2, each RNG step should only require rorx edx, eax, 3
/ crc32 eax, dl
. On Haswell/Skylake, that's 2 uops with total latency = 1 + 3 cycles for the loop-carried dependency. (http://agner.org/optimize/). Or 3 uops without BMI2, for mov edx, eax
/ shr edx,3
/ crc32 eax, dl
, but still only 4 cycles of latency on CPUs with zero-latency mov
for GP registers: Ivybridge+ and Ryzen.
2 uops is negligible impact on surrounding code in the normal case where you do enough work with each PRNG result that the 4-cycle dependency chain isn't a bottleneck. (Or ~9 cycle if your compiler stores/reloads the PRNG state inside the loop instead of keeping it live in a register and sinking the store to the global to after a loop, costing you 2 extra 1-uop instructions).
On Ryzen, crc32
is 3 uops with 3c total latency, so more impact on surrounding code but the same one per 4 clock bottleneck if you're doing so little with the PRNG results that you bottleneck on that.
I suspect you may have been benchmarking the loop-carried dependency chain bottleneck, not the impact on real surrounding code that does enough work to hide that latency. (Almost all relevant x86 CPUs are out-of-order execution.) Making an RNG even cheaper than xorshift128+, or even xorshift128, is likely to be of negligible benefit for most use-cases. xorshift128+ or xorshift128* is fast, and quite good quality for the speed.
If you want lots of PRNG results very quickly, consider using a SIMD xorshift128+ to run two or four generators in parallel (in different elements of XMM or YMM vectors). Especially if you can usefully use a __m256i
vector of PRNG results. See AVX/SSE version of xorshift128+, and also this answer where I used it.
Returning the entire state as the RNG result is usually a bad thing, because it means that one value tells you exactly what the next one will be. i.e. 3 is always followed by 1897987234 (fake numbers), never 3 followed by something else. Most statistical quality tests should pick this up, but this might or might not be a problem for any given use-case.
Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. CRC32c is also based on XOR and shift (but also with bit-reflect and modulo in Galois Field(2)), so it's reasonable to think it might be similar or better in quality.
You say your choice of crc32(rnd, rnd>>3)
gives a period of 2^32, and that's the best you can do with a state that small. (Of course rnd++
achieves the same period, so it's not the only measure of quality.) It's likely at least as good as an LCG, but those are not considered high quality, especially if the modulus is 2^32 (so you get it for free from fixed-width integer math).
I tested with a bitmap of previously seen values (NASM source), incrementing a counter until we reach a bitmap entry we've already seen. Indeed, all 2^32 values are seen before coming back to the original value. Since the period is exactly 0x100000000, that rules out any special values that are part of a shorter cycle.
Without memory access to the bitmap, Skylake does indeed run the loop at 4 cycles per iteration, as expected from the latency bottleneck. And rorx/crc32 is just 2 total uops.