5

I have an array of signed short that I want to divide by 2048 and get an array of float as a result.

I found SSE: convert short integer to float that allows to convert unsigned shorts to floats, but I want to handle signed shorts also.

The code below works but only for positive shorts.

// We want to divide some signed short by 2048 and get a float.
const auto floatScale = _mm256_set1_ps(2048);

short* shortsInput = /* values from somewhere */;
float* floatsOutput = /* initialized */;

__m128i* m128iInput = (__m128i*)&shortsInput[0];

// Converts the short vectors to 2 float vectors. This works, but only for positive shorts.
__m128i m128iLow = _mm_unpacklo_epi16(m128iInput[0], _mm_setzero_si128());
__m128i m128iHigh = _mm_unpackhi_epi16(m128iInput[0], _mm_setzero_si128());
__m128 m128Low = _mm_cvtepi32_ps(m128iLow);
__m128 m128High = _mm_cvtepi32_ps(m128iHigh);

// Puts the 2 __m128 vectors into 1 __m256.
__m256 singleComplete = _mm256_castps128_ps256(m128Low);
singleComplete = _mm256_insertf128_ps(singleComplete, m128High, 1);

// Finally do the math
__m256 scaledVect = _mm256_div_ps(singleComplete, floatScale);

// and puts the result where needed.
_mm256_storeu_ps(floatsOutput[0], scaledVect);

How can I convert my signed shorts to floats? Or maybe there's a better way to tackle this problem?


EDIT: I tried the different answers compared to a non SIMD algorithm, doing it 10M times over a 2048 array, on an AMD Ryzen 7 2700 at ~3.2GHz. I'm using Visual 15.7.3 with mostly the default config:

/permissive- /Yu"stdafx.h" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl 
/Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE"
/D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope
/arch:AVX2 /Gd /Oi /MD /openmp /FC /Fa"x64\Release\" /EHsc /nologo
/Fo"x64\Release\" /Fp"x64\Release\test.pch" /diagnostics:classic 

Note that I'm very new to SIMD and haven't use C++ for ages. Here's what I get (I reran each test separately and not one after the other and got better results like that):

  • No SIMD: 7300ms
  • wim's answer: 2300ms
  • chtz's SSE2 answer: 1650ms
  • chtz's AVX2 answer: 2100ms

So I get a nice speedup by using SIMD, and chtz's SSE2 answer, though being more verbose and complex to understand, is faster. (At least when compiled with AVX enabled, so it avoids extra instructions to copy registers by using 3-operand VEX-coded instructions. On Intel CPUs, the AVX2 versions should be significantly faster than the 128-bit version.)

Here's my test code:

const int size = 2048;
const int loopSize = (int)1e7;

float* noSimd(short* shortsInput) {
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j++) {
            floatsOutput[j] = shortsInput[j] / 2048.0f;
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld noSimd\n", totalTime);

    return floatsOutput;
}

float* wimMethod(short* shortsInput) {
    const auto floatScale = _mm256_set1_ps(1.0f / 2048.0f);
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j += 8) {
            __m128i short_vec = _mm_loadu_si128((__m128i*)&shortsInput[j]);
            __m256i int_vec = _mm256_cvtepi16_epi32(short_vec);
            __m256  singleComplete = _mm256_cvtepi32_ps(int_vec);

            // Finally do the math
            __m256 scaledVect = _mm256_mul_ps(singleComplete, floatScale);

            // and puts the result where needed.
            _mm256_storeu_ps(&floatsOutput[j], scaledVect);
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld wimMethod\n", totalTime);

    return floatsOutput;
}

float* chtzMethodSSE2(short* shortsInput) {
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j += 8) {
            // get input:
            __m128i val = _mm_loadu_si128((__m128i*)&shortsInput[j]);
            // add 0x8000 to wrap to unsigned short domain:
            val = _mm_add_epi16(val, const0x8000);
            // interleave with upper part of float(1<<23)/2048.f:
            __m128i lo = _mm_unpacklo_epi16(val, const0x4580);
            __m128i hi = _mm_unpackhi_epi16(val, const0x4580);
            // interpret as float and subtract float((1<<23) + (0x8000))/2048.f
            __m128 lo_f = _mm_sub_ps(_mm_castsi128_ps(lo), constFloat);
            __m128 hi_f = _mm_sub_ps(_mm_castsi128_ps(hi), constFloat);
            // store:
            _mm_storeu_ps(&floatsOutput[j], lo_f);
            _mm_storeu_ps(&floatsOutput[j] + 4, hi_f);
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld chtzMethod\n", totalTime);

    return floatsOutput;
}

float* chtzMethodAVX2(short* shortsInput) {
    const auto floatScale = _mm256_set1_ps(1.0f / 2048.0f);
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j += 8) {

            // get input:
            __m128i val = _mm_loadu_si128((__m128i*)&shortsInput[j]);
            // interleave with 0x0000
            __m256i val_unpacked = _mm256_cvtepu16_epi32(val);

            // 0x4580'8000
            const __m256 magic = _mm256_set1_ps(float((1 << 23) + (1 << 15)) / 2048.f);
            const __m256i magic_i = _mm256_castps_si256(magic);

            /// convert by xor-ing and subtracting magic value:
            // VPXOR avoids port5 bottlenecks on Intel CPUs before SKL
            __m256 val_f = _mm256_castsi256_ps(_mm256_xor_si256(val_unpacked, magic_i));
            __m256 converted = _mm256_sub_ps(val_f, magic);
            // store:
            _mm256_storeu_ps(&floatsOutput[j], converted);
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld chtzMethod2\n", totalTime);

    return floatsOutput;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user276648
  • 6,018
  • 6
  • 60
  • 86
  • 1
    Try to write a loop and see what the compiler vectorizes it to. Then you only need to check the intrinsics that correspond to those instructions. – Marc Glisse May 30 '18 at 06:47
  • 1
    You can use the unsigned version that you linked to - you just need to change the unpacking so that it does sign extension when converting from short to int, rather than putting 0 in the high bytes. – Paul R May 30 '18 at 07:25
  • 2
    You can use the ` _mm256_cvtepi16_epi32 (__m128i a)` intrinsic to convert to signed int. Then `_mm256_cvtepi32_ps (__m256i a)` to convert to float. – wim May 30 '18 at 07:37
  • 2
    Instead of `scaledVect = _mm256_div_ps(singleComplete, floatScale);`, I would use `floatScale = _mm256_set1_ps(1.0f/2048.0f);` and `scaledVect = _mm256_mul_ps(singleComplete, floatScale);`, which is much faster. – wim May 30 '18 at 07:39
  • 1
    @wim: I wonder if any compilers would do that for you? `1.0f/2048.0f` is exactly representable as a `float`, so the transformation is legal even without `-ffast-math`. – Peter Cordes May 30 '18 at 07:51
  • `m128iInput[0]` only works if your source data is aligned. Otherwise it will fault. And BTW, `_mm256_cvtepi32_ps` is in AVX1, so your AVX1 code could have shuffled together the integer vectors to feed one 256-bit conversion, instead of shuffling together the results of two 128-bit conversions. – Peter Cordes May 30 '18 at 07:52
  • @PeterCordes With respect to the `div`: not always. Godbold [link](https://godbolt.org/g/nEA3SA). GCC uses `vmulps` if the scale factor is `2048.0f`, but `vdivps` if the scale factor is `2048.1f` ! – wim May 30 '18 at 08:05
  • 2
    @wim: ah, so ICC takes intrinsics more literally, more like assembly language. That could be good or bad. Re: gcc: of course it doesn't take a reciprocal of `2048.1f` without `-ffast-math`, that would be illegal. With that option, it does optimize to a multiply by the nearest float to `1/2048.1f`, even though the exact value isn't exactly representable. – Peter Cordes May 30 '18 at 08:15
  • 2
    Out of curiosity: did you use an AMD cpu for the testing? What type of CPU exactly (brand+model)? – wim Jun 01 '18 at 11:17
  • And what compiler / options, and test-loop code? Did it use an indexed addressing mode for `vpmovsxwd` so it can't micro-fuse the load? – Peter Cordes Jun 02 '18 at 02:02
  • And was your output buffer 32-byte aligned? Seems weird that the SSE2 version would be faster than Wim's, except maybe on AMD, though. – Peter Cordes Jun 02 '18 at 02:24
  • @wim: Probably an AMD CPU. I benchmarked on Skylake, and both versions were essentially equal with a similar amount of loop unrolling. (And should be on Haswell as well, after I fixed chtz's to use VPXOR instead of VXORPS). But Ryzen has only one 128-bit execution unit for `int->float`, so `vpcvtdq2ps ymm` has one per 2 clock throughput and might be a bottleneck. Or maybe not, 256-bit stores are also one per 2 clock, if you're storing the result instead of using it on the fly. On ~4.4GHz Skylake, 1M iterations of 2048 elements takes ~81ms. (I added benchmarks to the answers :) – Peter Cordes Jun 02 '18 at 06:32
  • @user27...: If you get a chance, I'm curious how fast Ryzen runs the AVX2 version of chtz's code. It might possibly be slower than the SSE2 version on Ryzen (because it's 128-bit `vpaddd` before 2 shuffle uops vs. 256-bit `vpxor` after probably at least 2 shuffle uops), but on Skylake the AVX2 version should be about twice as fast as SSE2. (So keep that in mind if you care about AVX2 CPUs in general, not just AMD). – Peter Cordes Jun 04 '18 at 01:49
  • @PeterCordes: I updated the post and chtz's AVX2 version is indeed a bit slower that his SSE2 version. Note that I redid all the tests several times, but running only one method each time, to avoid possible side effects (memory or other) and I get better results like that. chtz's SSE2 version is still faster that the others. – user276648 Jun 04 '18 at 02:53
  • When you say one at a time, do you mean without any warm-up to give the CPU time to ramp up to full speed? 90ms is pretty short. You should probably bump your repeat count up by a factor of 10, and make sure the times scale linearly. If they don't, your measurement setup included some warm-up effects. (On Linux it works very well to put a microbenchmark in a static executable that does literally nothing else besides an `exit` system call, and profile the whole thing with performance counters. But if you just want time, a C++ high-rez timer built-in to the program works.) – Peter Cordes Jun 04 '18 at 02:56
  • 1
    @PeterCordes: you're right, 90ms started being quite low. I redid the test on 10M to minimize the warm-up penalty. I didn't change the code otherwise. – user276648 Jun 04 '18 at 03:25
  • Thanks for the detailed testing. The SKL benchmarks I added to the answers were done with 10M iters, getting a total time of ~817 ms at ~4.4GHz. What clock speed was your Ryzen running at? It shouldn't be possible for Ryzen to store more than 16 bytes per cycle (http://agner.org/optimize/ store throughput limits), but even if your Ryzen was running at 5GHz then that would be 4.82 floats per clock, or 19.3 bytes per clock. Are you sure you got your loop counters right, and didn't convert every other 16-byte block or something? My actual test code is in Godbolt links in the answers. – Peter Cordes Jun 04 '18 at 03:37
  • Max boost clock for [a Ryzen 7 2700](https://www.amd.com/en/products/cpu/amd-ryzen-7-2700) is 4.1GHz at stock speeds. And BTW, that store-throughput limit is to L1d cache. It's also an execution-port bottleneck: only one port for stores, and it's only 16 bytes wide. So either Agner's testing was flawed (possible), Ryzen is magic, or your compiler + your test code aren't doing as much work as you claim. :/ Oh wait, you used `/openmp`? Is your compiler multithreading this? But then why are you *only* getting ~5 floats per clock, then, instead of like 4 per core per clock? – Peter Cordes Jun 04 '18 at 03:42
  • @PeterCordes: you're right again! I won't be able to trick you into buying a Ryzen it seems...I mistakenly ran those tests with `size=1024` instead of 2048. I updated the timings again and got twice the amount of time. My Ryzen seems to be running at 3.2GHz. I'm not using OpenMP for that specific piece of code, as it seems I get poorer performances. – user276648 Jun 04 '18 at 04:02
  • 3.2 is the stock clock speed. I would have thought it would boost up to 4.1 if there's only one core active, even if it's doing a lot of SIMD work. But maybe not, IDK. Anyway, I checked MSVC output for your code (https://godbolt.org/g/f4uXKf), and it doesn't do anything weird. (It also doesn't unroll at all, so you might be losing speed to loop overhead here. `gcc` without `-funroll-loops` only runs 10M * 2048 in ~1200ms on SKL. But Ryzen has a wider front-end so it might not hurt as much.) Anyway, those results look sane now. – Peter Cordes Jun 04 '18 at 04:06
  • Ryzen has very good ALU throughput for a lot of things, especially shuffles, but it doesn't have Intel's L1d throughput. Converting on the fly to feed a loop that wants `float` input could certainly make sense, and give you much better computational intensity (ALU vs. memory). – Peter Cordes Jun 04 '18 at 04:13
  • @PeterCordes: 3.2 is indeed the stock clock speed, I didn't check if it was boosted or not. Thanks for the advice on converting on the fly, I'll try it in my real code. – user276648 Jun 04 '18 at 05:15
  • @user276648 Initially, I was a bit surprised by your performance results (128-bit SSE faster than 256-bit AVX), which brought me to the thought that that the benchmarks were executed with an AMD CPU. Thanks for clarifying your benchmark results. – wim Jun 05 '18 at 12:33

2 Answers2

6

You can replace the standard way of doing convert epi16->epi32->float and multiplying by 1.f/2048.f, by manually composing a float.

This works because the divisor is a power of 2, so manually composing the float just means a different exponent.

Thanks to @PeterCordes, here is an optimized AVX2 version of this idea, using XOR to set the upper bytes of the 32-bit float at the same time as flipping the sign bit of the integer value. FP SUB turns those low bits of the mantissa into the correct FP value:

// get input:
__m128i val = _mm_loadu_si128((__m128i*)input);
// interleave with 0x0000
__m256i val_unpacked = _mm256_cvtepu16_epi32(val);

// 0x4580'8000
const __m256 magic = _mm256_set1_ps(float((1<<23) + (1<<15))/2048.f);
const __m256i magic_i = _mm256_castps_si256(magic);

/// convert by xor-ing and subtracting magic value:
// VPXOR avoids port5 bottlenecks on Intel CPUs before SKL
__m256 val_f = _mm256_castsi256_ps(_mm256_xor_si256(val_unpacked, magic_i));
__m256 converted = _mm256_sub_ps(val_f, magic);
// store:
_mm256_storeu_ps(output, converted);

See it on the Godbolt compiler explorer with gcc and clang; on Skylake i7-6700k, a 2048 element loop that's hot in cache takes ~360 clock cycles, the same speed (to within measurement error) as @wim's version that does the standard sign-extend/convert/multiply (with a similar amount of loop unrolling). Tested by @PeterCordes with Linux perf. But on Ryzen, this could be significantly faster, because we avoid _mm256_cvtepi32_ps (Ryzen has 1 per 2 clock throughput for vcvtdq2ps ymm: http://agner.org/optimize/.)

The xor of 0x8000 with the lower half is equivalent to adding/subtracting 0x8000, since overflow/carry is ignored. And coincidentally, this allows to use the same magic constant for XOR-ing and subtracting.

Strangely, gcc and clang prefer to replace the subtraction with an addition of -magic, which will not re-use the constant ... They prefer using add because it's commutative, but in this case there's no benefit because they're not using it with a memory operand.


Here's an SSE2 version that does the signed/unsigned flip separately from setting the high 2 bytes of a 32-bit FP bit-pattern.

We're using one _mm_add_epi16, two _mm_unpackXX_epi16 and two _mm_sub_ps for 8 values (the _mm_castsi128_ps are no-ops, and the _mm_set would be cached in registers):

// get input:
__m128i val = _mm_loadu_si128((__m128i*)input);
// add 0x8000 to wrap to unsigned short domain:
// val = _mm_add_epi16(val, _mm_set1_epi16(0x8000));
val = _mm_xor_si128(val, _mm_set1_epi16(0x8000));  // PXOR runs on more ports, avoids competing with FP add/sub or unpack on Sandybridge/Haswell.

// interleave with upper part of float(1<<23)/2048.f:
__m128i lo = _mm_unpacklo_epi16(val, _mm_set1_epi16(0x4580));
__m128i hi = _mm_unpackhi_epi16(val, _mm_set1_epi16(0x4580));
// interpret as float and subtract float((1<<23) + (0x8000))/2048.f
__m128 lo_f = _mm_sub_ps(_mm_castsi128_ps(lo), _mm_set_ps1(float((1<<23) + (1<<15))/2048.f));
__m128 hi_f = _mm_sub_ps(_mm_castsi128_ps(hi), _mm_set_ps1(float((1<<23) + (1<<15))/2048.f));
// store:
_mm_storeu_ps(output, lo_f);
_mm_storeu_ps(output+4, hi_f);

Usage demonstration: https://ideone.com/b8BfJd

If your input would have been unsigned short, the _mm_add_epi16 would not be necessary (and the 1<<15 in the _mm_sub_ps would need to be removed, of course). Then you'd have Marat's answer on SSE: convert short integer to float.

This can easily be ported to AVX2 with twice as many conversions per iteration, but care has to be taken regarding the order of output elements (thanks @wim for pointing this out).


Also, for a pure SSE solution, one could simply use _mm_cvtpi16_ps, but that's an Intel library function. There's no single instruction that does this.

// cast input pointer:
__m64* input64 = (__m64*)input;
// convert and scale:
__m128 lo_f = _mm_mul_ps(_mm_cvtpi16_ps(input64[0]), _mm_set_ps1(1.f/2048.f));
__m128 hi_f = _mm_mul_ps(_mm_cvtpi16_ps(input64[1]), _mm_set_ps1(1.f/2048.f));

I did not benchmark any solution (nor check for theoretical throughputs or latencies)

chtz
  • 17,329
  • 4
  • 26
  • 56
  • 2
    Interesting! I have been thinking about porting this to AVX2. In principle that solution would be more efficient than my answer, but the short values in the output vectors are not in the 'natural' order isn't it? (Because the unpacking doesn't cross the 128-bit lanes.) – wim May 30 '18 at 08:41
  • 1
    @wim That's true, I actually forgot about this when claiming it could be ported easily. – chtz May 30 '18 at 09:17
  • 1
    When porting to AVX2, you can get the 'right' output order if you permute the input with: `_mm256_permute4x64_epi64`. – wim May 30 '18 at 10:09
  • 2
    @wim and chtz: I'd suggest `vpmovzxwd ymm, [mem128]` for a lane-crossing unpack, and `vpxor ymm, set1(0x4580'8000)` to set the upper half to the constant, and to flip the high bit of the low half all in one operation. (Note that signed->unsigned with ADD is equivalent to XOR, because the only possible carry is discarded. So carryless add (XOR) is equivalent, and thus we can use it after unpacking instead of having to mask off a carry from addition. Or we could still use `add_epi16` with `set1_epi32(0x4580'8000)` because `0+x=x` in the upper half. But VPXOR runs on more ports on pre-SKL. – Peter Cordes Jun 01 '18 at 01:18
  • `VXORPS` will produce an output in the FP domain without causing any domain-crossing penalty for reading input from integer, but don't do that because Intel before SKL will only run it on port 5, conflicting with `vpmovzx`. (And on SKL, the lack of domain-crossing depends on which port it runs on.) – Peter Cordes Jun 01 '18 at 01:22
  • @PeterCordes Good idea. `vpmovzxwd`+`vpxor`+`vsub` is one instruction less than my solution. – wim Jun 01 '18 at 10:24
  • @wim: no it isn't. Yours is also a memory-source `vpmovsxwd ymm, [mem128]` (which can micro-fuse into 1 uop), then `vcvtdq2ps ymm,ymm` / `vmulps ymm`. There's no separate intrinsic for memory-source `vpmovzx`, but when the memory source is at least 128 bits wide compilers will reliably fold the load. (Unlike `vpmovzxbd ymm, [mem64]`, where various ways of expressing a `movq` load fail to optimize away with some compilers. Stupid intrinsics.) But it is better than chtz's current answer, which would take 8 uops per 2 YMM vectors, vs. mine taking 6 on SKL. (With a port5 bottleneck.) – Peter Cordes Jun 01 '18 at 10:38
  • @chtz: **I strongly recommend *against* `_mm256_xor_ps`**, like I said in the middle of a previous comment. That will cut throughput in half on Haswell vs. `_mm256_xor_si256`, if the compiler actually uses `vxorps`. It will compete for port 5 against `vpmovzx` which is already your bottleneck. – Peter Cordes Jun 02 '18 at 01:28
  • @PeterCordes Apologies for wrongfully representing your suggestion, and thanks for fixing it! – chtz Jun 02 '18 at 16:48
5

With AVX2 it is not necessary to convert the high and low pieces separately:

const auto floatScale = _mm256_set1_ps(1.0f/2048.0f);

short* shortsInput = /* values from somewhere */;
float* floatsOutput = /* initialized */;

__m128i short_vec = _mm_loadu_si128((__m128i*)shortsInput);
__m256i int_vec =  _mm256_cvtepi16_epi32 (short_vec);
__m256  singleComplete = _mm256_cvtepi32_ps (int_vec);

// Finally do the math
__m256 scaledVect = _mm256_mul_ps(singleComplete, floatScale);

// and puts the result where needed.
_mm256_storeu_ps(floatsOutput, scaledVect);

This compiles nicely on the Godbolt compiler explorer, and with input/output hot in L1d cache and aligned input/output arrays, converts an array of 2048 elements in ~360 clock cycles on Skylake i7-6700k (tested in a repeat loop). That's ~0.18 cycles per element, or ~5.7 conversions per clock cycle. Or ~1.4 cycles per vector, including the store. It's mostly bottlenecked on front-end throughput (3.75 fused-domain uops per clock), even with clang's loop unrolling, because a conversion is 5 uops.

Note that vpmovsxwd ymm, [mem] can't micro-fuse into a single uop even with a simple addressing mode on Haswell/Skylake, so in this case it's good that recent gcc/clang transform pointer-increments into indexed addressing with a single loop counter. With most memory-source vector instructions (like vpmovsxwd xmm, [mem]), that would have cost an extra uop: Micro fusion and addressing modes.

With one load and one store, it's ok that the stores can't run on Haswell/Skylake's port7 store AGU, which only handles non-indexed addressing modes.

Loop unrolling is needed for max throughput on Intel CPUs (if there are no memory bottlenecks), because the load + convert + store is already 4 uops. Same as with @chtz's answer.

Ideally use the vector result for further computation right away, if you only need to read the float values a couple times. It's only 3 instructions (but does have some latency for out-of-order exec to hide). Redoing the conversion when needed might be better than having a bigger cache footprint to store the twice-as-large float[] result in memory; it depends on your use-case and the hardware.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
wim
  • 3,702
  • 19
  • 23
  • 1
    @PeterCordes Thanks for editing my answer and adding performance info to my (and chtz’s) answer! – wim Jun 05 '18 at 12:19