9

Consider a randomly generated __m256i vector. Is there a faster precise way to convert them into __m256 vector of floats between 0 (inclusively) and 1 (exclusively) than division by float(1ull<<32)?

Here's what I have tried so far, where iRand is the input and ans is the output:

const __m256 fRand = _mm256_cvtepi32_ps(iRand);
const __m256 normalized = _mm256_div_ps(fRand, _mm256_set1_ps(float(1ull<<32)));
const __m256 ans = _mm256_add_ps(normalized, _mm256_set1_ps(0.5f));
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • 6
    Multiply by `0x1p-31f`. Division is generally slower or more resource-consuming than multiplication. – Eric Postpischil Feb 25 '19 at 15:46
  • 6
    Also, since you are immediately adding, look into the SIMD fused multiply-add instructions. – Eric Postpischil Feb 25 '19 at 15:52
  • 1
    @EricPostpischil , thanks a lot. I was thinking more of some bit twiddling magic, but your simple suggestions should still make it much faster than my original version. – Serge Rogatch Feb 25 '19 at 15:55
  • 2
    I use bit twiddling when it is appropriate. Sometimes when there are fewer than 24 bits, you can patch them into the significand field of a `float` and use floating-point arithmetic to finish the job. But you are apparently supporting 31 bits (and a sign), so they have to be rounded. The convert instructions are designed for that, so you are not likely to do better. – Eric Postpischil Feb 25 '19 at 16:20
  • GCC 5 or later will compile your code to a convert and a FMA: https://godbolt.org/z/JadwrG. Clang also replaces the division by a multiplication, but requires `-ffast-math` (or some part of that) to fuse the addition and multiplication. – chtz Feb 25 '19 at 16:48
  • 1
    I don't think rounding is appropriate here, since it needs to be 1 exclusively... – Antti Haapala -- Слава Україні Feb 25 '19 at 18:58
  • Also, how about subnormal numbers? – Antti Haapala -- Слава Україні Feb 25 '19 at 19:01
  • @AnttiHaapala: Yes, good point, vcvtdq2ps will probably round up when converting `0x7fffffff` to `float`. FMA doesn't have intermediate rounding, but the convert is a problem for inputs outside the +- 2^24 range where `float` can exactly represent every integer. – Peter Cordes Feb 25 '19 at 20:53
  • Do you actually assume signed or unsigned integers as input? And could you clarify "precise"? Must `1ul` (or `0x80000001`) result in `2^-32`? And should `0xffffffff` (or `0x7fffffff`) result in the nearest `float` to `n*2^-32` (i.e., `1.0f`) or the largest `float` smaller than that? – chtz Feb 25 '19 at 21:06
  • @chtz, the integer random number generator makes unsigned 64-bit integers that span the whole vector of `__m256i`. So my code assumes they are signed because `_mm256_cvtepi32_ps()` works with signed integers, but there are no more reasons to treat them as signed. – Serge Rogatch Feb 26 '19 at 05:55
  • @SergeRogatch That does not answer what you mean by "precise", though. Do you actually just want random floats between `[0, 1)` (e.g., do you care if `0x0` gets mapped to `0.5f` or `0.0f`)? And is the "exclusive" important? – chtz Feb 26 '19 at 09:11
  • 1
    @chtz, exclusivity is important, but inclusivity is not. So it can be `(0;1)` range. I just need random floats - I don't care how they are mapped to integers. However, by precision I mean that there should be as many different float numbers as possible, and they should be able to reach as closely to 0 as possible. Because I later use this in Box-Muller transformation for generating normally-distributed numbers. – Serge Rogatch Feb 26 '19 at 09:46
  • If you want to calculate `sqrt(-2log(x))` for the Box-Muller transformation, you actually need `x>0`, but `x=1` would be ok. And there might be faster ways to calculate that directly from 32bits, without first creating a `float`. Same for calculating `sin` and `cos` of the second variable. – chtz Feb 26 '19 at 10:02
  • 2
    @chtz you just calculate `sqrt(-2log(1-x))` – Severin Pappadeux Feb 27 '19 at 04:10

2 Answers2

9

The version below should be faster, compared to your initial version that uses _mm256_div_ps

vdivps is quite slow, e.g. on my Haswell Xeon it’s 18-21 cycles latency, 14 cycles throughput. Newer CPUs perform better BTW, it’s 11/5 on Skylake, 10/6 on Ryzen.

As said in the comments, the performance is fixable by replacing divide with multiply and further improved with FMA. The problem with the approach is quality of distribution. If you’ll try to get these numbers in your output interval by rounding mode or clipping, you’ll introduce peaks in probability distribution of the output numbers.

My implementation is not ideal either, it doesn’t output all possible values in the output interval, skips many representable floats, especially near 0. But at least the distribution is very even.

__m256 __vectorcall randomFloats( __m256i randomBits )
{
    // Convert to random float bits
    __m256 result = _mm256_castsi256_ps( randomBits );

    // Zero out exponent bits, leave random bits in mantissa.
    // BTW since the mask value is constexpr, we don't actually need AVX2 instructions for this, it's just easier to code with set1_epi32.
    const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
    result = _mm256_and_ps( result, mantissaMask );

    // Set sign + exponent bits to that of 1.0, which is sign=0, exponent=2^0.
    const __m256 one = _mm256_set1_ps( 1.0f );
    result = _mm256_or_ps( result, one );

    // Subtract 1.0. The above algorithm generates floats in range [1..2).
    // Can't use bit tricks to generate floats in [0..1) because it would cause them to be distributed very unevenly.
    return _mm256_sub_ps( result, one );
}

Update: if you want better precision, use the following version. But it’s no longer “the fastest”.

__m256 __vectorcall randomFloats_32( __m256i randomBits )
{
    // Convert to random float bits
    __m256 result = _mm256_castsi256_ps( randomBits );
    // Zero out exponent bits, leave random bits in mantissa.
    const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
    result = _mm256_and_ps( result, mantissaMask );
    // Set sign + exponent bits to that of 1.0, which is sign=0, exponent = 2^0.
    const __m256 one = _mm256_set1_ps( 1.0f );
    result = _mm256_or_ps( result, one );
    // Subtract 1.0. The above algorithm generates floats in range [1..2).
    result = _mm256_sub_ps( result, one );

    // Use 9 unused random bits to add extra randomness to the lower bits of the values.
    // This increases precision to 2^-32, however most floats in the range can't store that many bits, fmadd will only add them for small enough values.

    // If you want uniformly distributed floats with 2^-24 precision, replace the second argument in the following line with _mm256_set1_epi32( 0x80000000 ).
    // In this case you don't need to set rounding mode bits in MXCSR.
    __m256i extraBits = _mm256_and_si256( randomBits, _mm256_castps_si256( mantissaMask ) );
    extraBits = _mm256_srli_epi32( extraBits, 9 );
    __m256 extra = _mm256_castsi256_ps( extraBits );
    extra = _mm256_or_ps( extra, one );
    extra = _mm256_sub_ps( extra, one );
    _MM_SET_ROUNDING_MODE( _MM_ROUND_DOWN );
    constexpr float mul = 0x1p-23f; // The initial part of the algorithm has generated uniform distribution with the step 2^-23.
    return _mm256_fmadd_ps( extra, _mm256_set1_ps( mul ), result );
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 1
    Probably not faster than Eric's suggestion of cvt + FMA on Intel CPUs, but I think avoids the correctness problem that Antti pointed out for rounding of large integers outside the +-2^24 range in the convert. This might also be faster on AMD CPUs where there's only one FMA unit. Or even on Intel if the surrounding code bottlenecks on FMA/mul/add and leaves port 5 idle. – Peter Cordes Feb 25 '19 at 20:56
  • 1
    @PeterCordes Right, by “faster” I meant compared to the original OP’s code with div_ps. While it doesn’t output all possible floats in the output interval, the distribution is good with this method. If you’ll fix the cvtepi32_ps issue with `_MM_ROUND_TOWARD_ZERO` in `MXCSR`, the distribution will be slightly skewed towards 0.5. – Soonts Feb 25 '19 at 22:23
  • 1
    Oh right, good point about distribution. This is uniformly distributed over the range of real numbers, rather than over representable float bit-patterns. It would be a good idea to say some of that in the answer, and also say *why* it's faster (that `div` is quite slow), and for bonus points say something vs. FMA. Because there are lots of interesting issues here beyond speed. – Peter Cordes Feb 25 '19 at 22:32
  • 1
    From what the OP clarified in the comments, I guess a convert+FMA would be better, as it generates values closer to `0.0f` (and wastes fewer bits of the random integer). And generating `1.0f` should not be a problem for Box-Muller (exactly `0.0f` is however). – chtz Feb 26 '19 at 19:59
  • @PeterCordes Yes, a lot of interesting issues. I posted another version of conversion routine, would welcome your comments – Severin Pappadeux Feb 26 '19 at 20:01
3

First, no division, replace it with multiplication. While @Soonts might be good enough for you, I could only note due to the use of mapping to [1...2) interval, it produces uniform dyadic rationals of the form k/2−23, which is half of what could be generated. I prefer method from S.Vigna (at the bottom), with all dyadic rationals of the form k/2−24 being equally likely.

Code, VC++2019, x64, Win10, Intel i7 Skylake

#include <random>

#include "immintrin.h"

auto p256_dec_u32(__m256i in) -> void {
    alignas(alignof(__m256i)) uint32_t v[8];
    _mm256_store_si256((__m256i*)v, in);
    printf("v8_u32: %u %u %u %u %u %u %u %u\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}

auto p256_dec_f32(__m256 in) -> void {
    alignas(alignof(__m256)) float v[8];
    _mm256_store_ps(v, in);
    printf("v8_float: %e %e %e %e %e %e %e %e\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}

auto main() -> int {
    const float c = 0x1.0p-24f; // or (1.0f / (uint32_t(1) << 24));

    const int N = 1000000;

    std::mt19937 rng{ 987654321ULL };

    __m256 sum = _mm256_set1_ps(0.0f);

    for (int k = 0; k != N; ++k) {
        alignas(alignof(__m256i)) uint32_t rnd[8] = { rng(), rng(), rng(), rng(), rng(), rng(), rng(), rng() };

        __m256i r = _mm256_load_si256((__m256i*)rnd);
        __m256  q = _mm256_mul_ps(_mm256_cvtepi32_ps(_mm256_srli_epi32(r, 8)), _mm256_set1_ps(c));

        sum = _mm256_add_ps(sum, q);
    }

    sum = _mm256_div_ps(sum, _mm256_set1_ps((float)N)); // computing average

    p256_dec_f32(sum);

    return 0;
}

with output

5.002970e-01 4.997833e-01 4.996118e-01 5.004955e-01 5.002163e-01 4.997193e-01 4.996586e-01 5.001499e-01
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • 1
    This is single-precision. `k/2^-52` would be for `__m256d`. – Peter Cordes Feb 26 '19 at 20:04
  • Why shift 8 bits to the right? I'd understand shifting by one bit, to remove the sign bit, but you unnecessarily remove 7 random bits (which would be useful to generate numbers between 0 and 2^-24). – chtz Feb 26 '19 at 20:04
  • 2
    @chtz Yes, this method produces only values in the form k/2^−24, but equally distributed, some random input is lost. In order to get all possible float, you have to do something like http://mumble.net/~campbell/2014/04/28/uniform-random-float, and the very bottom of S.Vigna page – Severin Pappadeux Feb 26 '19 at 20:13