Generate random numbers in a given range with AVX2, faster than SVML _mm256_rem_epu32 remainder?

Question

I'm currently trying to implement an XOR_SHIFT Random Number Generator using AVX2, it's actually quite easy and very fast. However I need to be able to specify a range. This usually requires modulo.

This is a major problem for me for 2 reasons:

Adding the _mm256_rem_epu32() / _mm256_rem_epi32() SVML function to my code takes the run time of my loop from around 270ms to 1.8 seconds. Ouch!
SVML is only available on MSVC and Intel Compilers

Are the any significantly faster ways to do modulo using AVX2?

Non Vector Code:

 std::srand(std::time(nullptr));
 std::mt19937_64 e(std::rand());

 uint32_t seed = static_cast<uint32_t>(e());

 for (; i != end; ++i)
 {
      seed ^= (seed << 13u);
      seed ^= (seed >> 7u);
      seed ^= (seed << 17u);

      arr[i] = static_cast<T>(low + (seed % ((up + 1u) - low)));
 }//End for

Vectorized:

  constexpr uint32_t thirteen = 13u;
  constexpr uint32_t seven = 7u;
  constexpr uint32_t seventeen = 17u;

  const __m256i _one = _mm256_set1_epi32(1);
  const __m256i _lower = _mm256_set1_epi32(static_cast<uint32_t>(low));
  const __m256i _upper = _mm256_set1_epi32(static_cast<uint32_t>(up));
                           
  __m256i _temp = _mm256_setzero_si256();
  __m256i _res = _mm256_setzero_si256();
                                                            
  __m256i _seed = _mm256_set_epi32(
       static_cast<uint32_t>(e()),
       static_cast<uint32_t>(e()),
       static_cast<uint32_t>(e()),
       static_cast<uint32_t>(e()),
       static_cast<uint32_t>(e()),
       static_cast<uint32_t>(e()),
       static_cast<uint32_t>(e()),
       static_cast<uint32_t>(e())
  );

  for (; (i + 8uz) < end; ++i)
  {
       //Generate Random Numbers
       _temp = _mm256_slli_epi32(_seed, thirteen);
       _seed = _mm256_xor_si256(_seed, _temp);

       _temp = _mm256_srai_epi32(_seed, seven);
       _seed = _mm256_xor_si256(_seed, _temp);

       _temp = _mm256_slli_epi32(_seed, seventeen);
       _seed = _mm256_xor_si256(_seed, _temp);

       //Narrow
       _temp = _mm256_add_epi32(_upper, _one);
       _temp = _mm256_sub_epi32(_temp, _lower);
       _temp = _mm256_rem_epu32(_seed, _temp); //Comment this line out for a massive speed up but incorrect results
       _res = _mm256_add_epi32(_lower, _temp);                                        

       _mm256_store_si256((__m256i*) &arr[i], _res);
  }//End for

For a compile-time-constant range, use a multiplicative inverse for division. For example, like I did in [What's the fastest way to generate a 1 GB text file containing random digits?](https://unix.stackexchange.com/a/324520) for chopping up entropy from AVX2 xorshift+ output into base-10 digits, using GNU C vector extensions to let the compiler generate that actual multiply constant and immediate-shift. (You could then of course port that asm back to intrinsics for use with other compilers.) In my case, 16-bit chunks was useful, and that's an element size x86 has SIMD multiply for. — Peter Cordes, Jan 02 '22 at 18:06
Almost a duplicate of [Vectorized Ranged Random number generation across all types](https://stackoverflow.com/q/53583550) which mentions libdivide. For your original title, just about doing modulo without talking about random numbers, it would be a duplicate of [SSE integer division?](https://stackoverflow.com/q/16822757) which mentions libdivide and VCL (https://github.com/vectorclass/version2/blob/fee0601edd3c99845f4b7eeb697cff0385c686cb/vectori128.h#L6240 - VCL has multiplicative-inverse in a way that works with constant-propagation and hoisting out of loops). — Peter Cordes, Jan 02 '22 at 18:15
https://stackoverflow.com/questions/30790184/perform-integer-division-using-multiplication The paper by Montgomery as commented by njuffa shows the generic algorithm (or handful of algorithms, since some moduli do not require all the post corrections steps) to divide by reciprocal multiplication. Thus, it's best to make a generator which returns the optimal sequence... — Aki Suihkonen, Jan 02 '22 at 18:15
Also, you know you can do `__m256i range = _mm256_set1_epi32(arg2 - arg1 + 1)`, right? No need to encourage silly compilers (*cough* MSVC) to actually do those loop-invariant operations inside the loop, or in SIMD registers at all. — Peter Cordes, Jan 02 '22 at 18:21
Could provide a more detailed explanation of how I might use these methods here. As far as I can tell all your suggestions require compile-time or run-time constants _seed is neither. Also Agner Fog's Vector class only allows dividing by single integer constants, not an entire vector. — dave_thenerd, Jan 02 '22 at 18:48
Instead of `rand() % range` you usually get similarly well distributed numbers using `(rand() * range) / RAND_MAX` (it really depends a lot on your RNG). This is also not completely trivial with AVX2 for 32bit integers, but it should be much faster than calculating a remainder. — chtz, Jan 02 '22 at 20:24
@chtz Thanks but I want to have upper and lower bounds. (rand() * range) / RAND_MAX only works if rand() returns a number which when multiplied by range is less than the upper bounds. This also requires a floating point divide which is awkward in AVX when you're working with integers. — dave_thenerd, Jan 02 '22 at 21:43
Multiply-high also works for range reduction and is less awkward. E: eg `mulhi(rand, range)` would put the result in `[0 .. range)` if `rand` is a random 32bit number. Intuition being that `rand` is effectively a Q32 number between 0 and 1 (exclusive upper bound) — harold, Jan 02 '22 at 21:55
@dave_thenerd: Do you need a different range for each vector element? Your current code uses the same divisor for all, so I don't see why that VCL design would be a problem. Re: being a compile-time constant or not; if you only have a few different ranges you need, and each use-case passes the same constant, you can make it a template parameter so it *is* a compile-time constant after inlining. But if that's not viable, then yeah you need to sort it out at run-time, hopefully reusing work across a whole loop. — Peter Cordes, Jan 02 '22 at 22:53
@dave_thenerd: re: having a non-zero lower bound. Obviously you do that separately from the range part, just an add at the end. (Unless you're converting integers to FP, in which case maybe fold that in to a final FMA, e.g. `fma(rand() * (high-low), 1./RAND_MAX, low)`. Of course with `float` you only have 24 bits of mantissa precision, so a range that's wider than 2^24 would end up with only even numbers. As chtz said, it's non-trivial, although using FP isn't the only approach. Exact integer division should also be possible with `vpmuludq` for multiplicative inverses.) — Peter Cordes, Jan 02 '22 at 22:59

Soonts · Accepted Answer · 2022-01-03T12:47:30.020

If you range is smaller than ~16.7 million, and you don’t need cryptography-grade quality of the distribution, an easy and relatively fast method of narrowing these random numbers is FP32 math.

Here’s an example, untested. The function below takes integer vector with random bits, and converts these bits into integer numbers in [ 0 .. range - 1 ] interval.

// Ideally, make sure this function is inlined,
// by applying __forceinline for vc++ or __attribute__((always_inline)) for gcc/clang
inline __m256i narrowRandom( __m256i bits, int range )
{
    assert( range > 1 );

    // Convert random bits into FP32 number in [ 1 .. 2 ) interval
    const __m256i mantissaMask = _mm256_set1_epi32( 0x7FFFFF );
    const __m256i mantissa = _mm256_and_si256( bits, mantissaMask );
    const __m256 one = _mm256_set1_ps( 1 );
    __m256 val = _mm256_or_ps( _mm256_castsi256_ps( mantissa ), one );

    // Scale the number from [ 1 .. 2 ) into [ 0 .. range ),
    // the formula is ( val * range ) - range
    const __m256 rf = _mm256_set1_ps( (float)range );
    val = _mm256_fmsub_ps( val, rf, rf );

    // Convert to integers
    // The instruction below always truncates towards 0 regardless on MXCSR register.
    // If you want ranges like [ -10 .. +10 ], use _mm256_add_epi32 afterwards
    return _mm256_cvttps_epi32( val );
}

When inlined, it should compile into 4 instructions, vpand, vorps, vfmsub132ps, vcvttps2dq Probably an order of magnitude faster than _mm256_rem_epu32 in your example.

I've looked at some other solutions and yours is by far the simplest and the fastest. I pray to God, Lisa Su, and Pat Gelsinger that one day we will get actual vector integer divide and modulo instructions in hardware. 'cause this is honestly a ridiculous capability to lack in my opinion. (Vector fmod would also be pretty nice...) — dave_thenerd, Jan 11 '22 at 05:32
How would I implement this for 64-Bit Integers? Just changing everything to use double precision (including setting mantissaMask = 0x7FFFFFFFFFFFFFFF) results in some numbers overflowing to or near UINT64_T::MAX. i.e. I want range 1-100 and get results like: 4,377,922,996,578,428,928 — dave_thenerd, Jan 13 '22 at 20:53
@dave_thenerd No idea what have you changed, but if you want random 64-bit numbers in [1 .. 100] range, use the same FP32 code twice as narrow (i.e. operating on 16-byte SSE vectors), plus `_mm256_cvtepi32_epi64` to upcast in the end. — Soonts, Jan 13 '22 at 21:05

Generate random numbers in a given range with AVX2, faster than SVML _mm256_rem_epu32 remainder?

1 Answers1