11

I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is

#include <stdint.h>
uint64_t s[ 2 ];

uint64_t next(void) { 
    uint64_t s1 = s[ 0 ];
    const uint64_t s0 = s[ 1 ];
    s[ 0 ] = s0;
    s1 ^= s1 << 23; // a
    return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c
}

I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could you use these to speed up this code (assuming you want to make billions of such random numbers) and what is the expected limit to this speedup in practice?

Simd
  • 19,447
  • 42
  • 136
  • 271

2 Answers2

8

For anyone else who might reach this question, I think this C++ code implements correctly 4 xorshift128plus generators running in parallel, using AVX2:

__m256i xorshift128plus_avx2(__m256i &state0, __m256i &state1)
{
    __m256i s1 = state0;
    const __m256i s0 = state1;
    state0 = s0;
    s1 = _mm256_xor_si256(s1, _mm256_slli_epi64(s1, 23));
    state1 = _mm256_xor_si256(_mm256_xor_si256(_mm256_xor_si256(s1, s0),
                                               _mm256_srli_epi64(s1, 18)),
                              _mm256_srli_epi64(s0, 5));
    return _mm256_add_epi64(state1, s0);
}

The scalar implementation I used is:

u64 xorshift128plus(u64 &state0, u64 &state1)
{
    u64 s1 = state0;
    const u64 s0 = state1;
    state0 = s0;
    s1 ^= s1 << 23;                              // a
    state1 = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
    return state1 + s0;
}

Which is the same one in the xorshiftplus paper. Note that the right-shift constants from the original question do not correspond to the ones in the paper.

o9000
  • 1,626
  • 13
  • 15
  • 3
    You could use a different order of operations to get more instruction-level parallelism, in case the compiler doesn't know that `_mm256_xor` is associative. `s1^s0` can run in parallel with the shifts. Oops NVM, I misread and s1^s0 is actually in the innermost nesting. I find big nesting of intrinsics in a single expression pretty unreadable, and find it better to name temporary variables with descriptive names for more readability. So [gcc does make good code here](http://goo.gl/PJJRzQ). – Peter Cordes Mar 06 '16 at 03:34
  • Indeed, I was counting on the compiler to figure that out. BTW, that site is a gem! Bookmarked. – o9000 Mar 06 '16 at 11:59
  • The constants in the question are what's on wikipedia, and are the first entry in one of the tables in the paper. IDK which constants are the best choice, but the currently-published ones in http://xoroshiro.di.unimi.it/xorshift128plus.c do match this, not Wikipedia. – Peter Cordes Nov 18 '16 at 21:48
7

XorShift is indeed a good choice. It is so good, so fast and requires so little state that I'm surprised to see so little adoption. It should be the standard generator on all platforms. I have implemented it myself 8 years ago and even then it could generate 800MB/s of random bytes.

You cannot use vector instructions to speed up generating a single random number. There is too little instruction-level parallelism in those few instructions.

But you can easily speed up generating N numbers where N is the vector size of your target instruction set. Just run N generators in parallel. Keep state for N generators and generate N numbers at the same time.

If client code demands numbers one at a time you could keep a buffer of N (or more) numbers. If the buffer is empty you fill it using vector instructions. If the buffer is not empty you just return the next number.

usr
  • 168,620
  • 35
  • 240
  • 369
  • I think xorshift itself isn't good. However this "plus" version is which is what is interesting me. – Simd Jun 02 '14 at 19:50
  • @user2179021 it passes all but one of the die hard tests AFAIK. So it is pretty good. Anyway, did I answer your question? – usr Jun 02 '14 at 20:07
  • 1
    I was really looking for some AVX/SSE code or something specific about how much my code could potentially be sped up. – Simd Jun 02 '14 at 21:23
  • 4
    You can run two such generators in parallel using SSE - you would need to seed them differently of course, otherwise they would just be generating the same values for each. Use `_mm_srli_epi64` for the right shift, `_mm_xor_si128` for the XOR and `_mm_add_epi64` for the add. – Paul R Jun 02 '14 at 21:35
  • @PaulR Thank you. Would you expect AVX to be any faster? – Simd Jun 03 '14 at 07:45
  • 2
    You don't have the equivalent operations in AVX unfortunately - you would need AVX2, which is available on Haswell and later - that would enable you to run 4 x 64 bit RNGs in parallel rather than 2. – Paul R Jun 03 '14 at 07:52
  • Actually AVX1 would still save `movdqa` copy instructions for a 128b implementation. So the code would be a little smaller and probably run a little faster, since xorshift+ benefits from non-destructive shift operations that leave the original around. – Peter Cordes Dec 20 '15 at 22:34