I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is
#include <stdint.h>
uint64_t s[ 2 ];
uint64_t next(void) {
uint64_t s1 = s[ 0 ];
const uint64_t s0 = s[ 1 ];
s[ 0 ] = s0;
s1 ^= s1 << 23; // a
return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c
}
I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could you use these to speed up this code (assuming you want to make billions of such random numbers) and what is the expected limit to this speedup in practice?