SIMD for float threshold operation

Question

I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation:

void func(const double* left, const double* right, double* res, const size_t size, const double th, const double drop) {
        for (size_t i = 0; i < size; ++i) {
            res[i] = right[i] >= th ? left[i] : (left[i] - drop) ;
        }
    }

Mainly, it drops the left value by drop in case right value is higher than threshold.

The size is around 128-256 (not that big), but computation is called heavily.

I tried to start with loop unrolling, but did not win a lot of performance, but may be some compile instructions are needed.

Could you please suggest some improvement into the code for faster computation?

Strict optimizations of this kind are architecture-specific. Maybe you want to add some details about your platform? — BiagioF, Jun 19 '19 at 14:32
@BiagioFesta yes, of course, I believe our cluster is based on Xeon and AVX2 is available, but I am not sure about AVX512 (I assume it is not allowed) — Sindbag, Jun 19 '19 at 14:40
Compiling this with clang gives some decent auto-vectorized SIMD code: https://godbolt.org/z/lnwLPc (without the `__restrict` there is some additional code which first checks if the memory areas overlap) — chtz, Jun 19 '19 at 14:47
If you are ok with calculating `left[i] - 0.0` in the true-case, you could replace the `vblendvpd` by `vandpd` of `drop` with the result of `!(right[i]>=th)` and subtract that result from `left[i]`: https://godbolt.org/z/-LIbSY — chtz, Jun 19 '19 at 14:56
Can the whole thing be done with actual `float` instead of `double`? That would be a non-local change.. — harold, Jun 19 '19 at 18:05
Updated my answer with a version that uses @chtz's suggestion: it auto-vectorizes even with GCC (without `-ffast-math`), and auto-vectorizes more efficiently for clang. — Peter Cordes, Jun 20 '19 at 20:31

Peter Cordes · Accepted Answer · 2019-08-29T22:09:11.953

Clang already auto-vectorizes this pretty much the way Soonts suggested doing manually. Use __restrict on your pointers so it doesn't need a fallback version that works for overlap between some of the arrays. It still auto-vectorizes, but it bloats the function.

Unfortunately gcc only auto-vectorizes with -ffast-math. It turns out only -fno-trapping-math is required: that's generally safe especially if you aren't using fenv access to unmask any FP exceptions (feenableexcept) or looking at MXCSR sticky FP exception flags (fetestexcept).

With that option, then GCC too will use (v)pblendvpd with -march=nehalem or -march=znver1. See it on Godbolt

Also, your C function is broken. th and drop are scalar double, but you declare them as const double *

AVX512F would let you do a !(right[i] >= thresh) compare and use the resulting mask for a merge-masked subtract.

Elements where the predicate was true will get left[i] - drop, other elements will keep their left[i] value, because you merge info a vector of left values.

Unfortunately GCC with -march=skylake-avx512 uses a normal vsubpd and then a separate vmovapd zmm2{k1}, zmm5 to blend, which is obviously a missed optimization. The blend destination is already one of the inputs to the SUB.

Using AVX512VL for 256-bit vectors (in case the rest of your program can't efficiently use 512-bit, so you don't suffer reduced turbo clock speeds):

__m256d left = ...;
__m256d right = ...;
__mmask8 cmp = _mm256_cmp_pd_mask(right, set1(th), _CMP_NGE_UQ);
__m256d res = _mm256_mask_sub_pd (left, cmp, left, set1(drop));

So (besides the loads and store) it's 2 instructions with AVX512F / VL.

If you don't need the specific NaN behaviour of your version, GCC can auto-vectorize too

And it's more efficient with all compilers because you just need an AND, not a variable-blend. So it's significantly better with just SSE2, and also better on most CPUs even when they do support SSE4.1 blendvpd, because that instruction isn't as efficient.

You can subtract 0.0 or drop from left[i] based on the compare result.

Producing 0.0 or a constant based on a compare result is extremely efficient: just an andps instruction. (The bit-pattern for 0.0 is all-zeros, and SIMD compares produce vectors of all-1 or all-0 bits. So AND keeps the old value or zeros it.)

We can also add -drop instead of subtracting drop. This costs an extra negation on input, but with AVX allows a memory-source operand for vaddpd. GCC chooses to use an indexed addressing mode so that doesn't actually help reduce the front-end uop count on Intel CPUs, though; it will "unlaminate". But even with -ffast-math, gcc doesn't do this optimization on its own to allow folding a load. (It wouldn't be worth doing separate pointer increments unless we unroll the loop, though.)

void func3(const double *__restrict left, const double *__restrict right, double *__restrict res,
  const size_t size, const double th, const double drop)
{
    for (size_t i = 0; i < size; ++i) {
        double add = right[i] >= th ? 0.0 : -drop;
        res[i] = left[i] + add;
    }
}

GCC 9.1's inner loop (without any -march options and without -ffast-math) from the Godbolt link above:

# func3 main loop
# gcc -O3 -march=skylake       (without fast-math)
.L33:
    vcmplepd        ymm2, ymm4, YMMWORD PTR [rsi+rax]
    vandnpd ymm2, ymm2, ymm3
    vaddpd  ymm2, ymm2, YMMWORD PTR [rdi+rax]
    vmovupd YMMWORD PTR [rdx+rax], ymm2
    add     rax, 32
    cmp     r8, rax
    jne     .L33

Or the plain SSE2 version has an inner loop that's basically the same as with left - zero_or_drop instead of left + zero_or_minus_drop, so unless you can promise the compiler 16-byte alignment or you're making an AVX version, negating drop is just extra overhead.

Negating drop takes a constant from memory (to XOR the sign bit), and that's the only static constant this function needs, so that tradeoff is worth considering for your case where the loop doesn't run a huge number of times. (Unless th or drop are also compile-time constants after inlining, and are getting loaded anyway. Or especially if -drop can be computed at compile time. Or if you can get your program to work with a negative drop.)

Another difference between adding and subtracting is that subtracting doesn't destroy the sign of zero. -0.0 - 0.0 = -0.0, +0.0 - 0.0 = +0.0. In case that matters.

# gcc9.1 -O3
.L26:
    movupd  xmm5, XMMWORD PTR [rsi+rax]
    movapd  xmm2, xmm4                    # duplicate  th
    movupd  xmm6, XMMWORD PTR [rdi+rax]
    cmplepd xmm2, xmm5                    # destroy the copy of th
    andnpd  xmm2, xmm3                    # _mm_andnot_pd
    addpd   xmm2, xmm6                    # _mm_add_pd
    movups  XMMWORD PTR [rdx+rax], xmm2
    add     rax, 16
    cmp     r8, rax
    jne     .L26

GCC uses unaligned loads so (without AVX) it can't fold a memory source operand into cmppd or subpd

This line of assembly: `vcmplepd ymm3, ymm4, YMMWORD PTR [rsi+rax]` Won’t this instruction crash in runtime unless the caller code aligns these buffers with AVX in mind? My version uses `_mm_loadu_pd` for compatibility with non-SIMD aware layouts, such as `std::vector` — Soonts, Jun 20 '19 at 08:22
@Soonts: VEX-encoded instructions don't require alignment of their memory-source operands, except for instructions that specifically *do* have alignment requirements (like `vmovaps` or `vmovntps`). This is one of the many advantages of VEX encoding, even for 128-bit instructions. If you compile your version with `-march=sandybridge`, you'll see that the compile can fold `_mm_loadu_pd` into a memory operand for a compare. (If it can find a predicate that fits with `right[]` as the 2nd source; but AVX does add a bunch of new comparison predicates for `vcmppd`) — Peter Cordes, Jun 20 '19 at 08:32
Interesting, thanks. Couple years ago I’ve fixed a bug where I forgot to align one thing, and in 32-bit builds _mm_and_si128 sometimes crashed accessing the value. I was building C++ with VS2015, not sure about instruction encoding. — Soonts, Jun 20 '19 at 08:52
@Soonts: you must have been building without `/arch:AVX`, where only a `_mm_load_si128()` can fold into a memory operand for legacy SSE2 `pand`. The VEX encoding, `vpand`, will never fault for alignment. — Peter Cordes, Jun 20 '19 at 18:19
@MarcGlisse: That's a missed optimization for AVX512, then: My understanding is that a masked subtract will not even compute the subtract in the masked elements, so they won't raise FP exceptions (or set exception bits in MXCSR). So we get FP fault suppression as well as memory fault suppression for memory sources. (Of course this would require GCC to actually do a masked subtract, instead of a separate blend.) — Peter Cordes, Jun 20 '19 at 19:44
@Marc: Hoisting `double ldrop = left[i]-drop;` out of the ternary means the C always computes it, like branchless asm wants to. GCC still won't auto-vec unless we use `-fno-trapping-math` https://godbolt.org/z/1K_1L2. The non-vectorized scalar version uses a branch, so it *doesn't* cause an FP exception if `right[i]>th`, for `left[i]` or `drop` being NaN. Is that a bug? (I'm also not clear on whether trapping-math considers the *number* of FP exceptions a visible side-effect (i.e. unmasked and actually trapping) or if it's just whether or not the sticky exception bits in MXCSR ever get set.) — Peter Cordes, Jun 20 '19 at 19:58
`-ftrapping-math` and `-frounding-math` are completely broken, they never worked. Expecting them to make consistent, sensible choices is rather optimistic. — Marc Glisse, Jun 20 '19 at 20:44
@MarcGlisse: Has anyone proposed turning them off by default? If they don't work then they're just creating missed optimizations for no(?) benefit. — Peter Cordes, Jun 20 '19 at 21:08
Another (in most cases irrelevant) effect of adding instead of subtracting: `-0.0 + 0.0 == +0.0`, whereas `-0.0 - 0.0 == -0.0`, i.e., `-0.0` inside `left` will get converted to `+0.0`. — chtz, Jun 21 '19 at 11:11

score 4 · Answer 2 · answered Jun 20 '19 at 06:04

Here you go (untested), I’ve tried to explain in the comments what they do.

void func_sse41( const double* left, const double* right, double* res,
    const size_t size, double th, double drop )
{
    // Verify the size is even.
    // If it's not, you'll need extra code at the end to process last value the old way.
    assert( 0 == ( size % 2 ) );

    // Load scalar values into 2 registers.
    const __m128d threshold = _mm_set1_pd( th );
    const __m128d dropVec = _mm_set1_pd( drop );

    for( size_t i = 0; i < size; i += 2 )
    {
        // Load 4 double values into registers, 2 from right, 2 from left
        const __m128d r = _mm_loadu_pd( right + i );
        const __m128d l = _mm_loadu_pd( left + i );
        // Compare ( r >= threshold ) for 2 values at once
        const __m128d comp = _mm_cmpge_pd( r, threshold );
        // Compute ( left[ i ] - drop ), for 2 values at once
        const __m128d dropped = _mm_sub_pd( l, dropVec );
        // Select either left or ( left - drop ) based on the comparison.
        // This is the only instruction here that requires SSE 4.1.
        const __m128d result = _mm_blendv_pd( l, dropped, comp );
        // Store the 2 result values
        _mm_storeu_pd( res, result );
    }
}

The code will crash with “invalid instruction” runtime error if the CPU doesn’t have SSE 4.1. For best result, detect with CPU ID to fail gracefully. I think now in 2019 it’s quite reasonable to assume it’s supported, Intel did in 2008, AMD in 2011, steam survey says “96.3%”. If you want to support older CPUs, possible to emulate _mm_blendv_pd with 3 other instructions, _mm_and_pd, _mm_andnot_pd, _mm_or_pd.

If you can guarantee the data is aligned, replacing loads with _mm_load_pd will be slightly faster, _mm_cmpge_pd compiles into CMPPD https://www.felixcloutier.com/x86/cmppd which can take one of the arguments directly from RAM.

Potentially, you can get further 2x improvement by writing AVX version. But I hope even SSE version is faster than your code, it handles 2 values per iteration, and doesn’t have conditions inside the loop. If you’re unlucky, AVX will be slower, many CPUs need some time to power on their AVX units, takes many thousands of cycles. Until powered, AVX code runs very slowly.

chtz's comment suggested avoiding `blendvpd` in favour of unconditional subtract of an `andps` result. That's both more efficient and only requires SSE2, with the only possible difference being with NaN inputs. — Peter Cordes, Jun 20 '19 at 06:17
During the AVX "warm up" time, Agner Fog reports something like 1/4 performance on Intel. That's still ~= scalar for, or better if the scalar was inefficient. And the AVX upper lane stays powered on / warmed up for milliseconds, so that can only bite if you use 256-bit vectors very infrequently, and thus it can't be a throughput problem. (But possibly latency if you need to respond quickly to a rare event.) — Peter Cordes, Jun 20 '19 at 06:24
@PeterCordes About substraction trick, I’m not sure about faster, `blendpd` has 1 cycle latency, the throughput is 2-3 instructions/cycle; `andpd` also 1 cycle latency but before skylake the throughput was only 1 instruction/cycle, which may suggest it was only executed by a single EU. But indeed, definitely more compatible, and faster than 3 bitwise operations workaround I’ve mentioned. — Soonts, Jun 20 '19 at 07:24
@Soonts: on Ryzen, variable blends are no more expensive than immediate blends. But on Intel they are, except for Skylake with the non-VEX legacy SSE encoding (implicit blend-control source in XMM0). On all other Intel CPUs, variable blends are extra uops or even limited to the shuffle port. Have a look at [Agner Fog's instruction tables](https://agner.org/optimize/) outside of the Skylake and Ryzen tabs. (Although those *are* the most relevant microarchitectures going forward: Ryzen and SKL (and KBL / CFL) are more and more widespread. But that doesn't help for AVX `vblendvpd`.) — Peter Cordes, Jun 20 '19 at 07:45

score 2 · Answer 3 · answered Jun 20 '19 at 11:57

You can use GCC's and Clang's vector extensions to implement a ternary select function (see https://stackoverflow.com/a/48538557/2542702).

#include <stddef.h>
#include <inttypes.h>

#if defined(__clang__)
typedef  double double4 __attribute__ ((ext_vector_type(4)));
typedef int64_t   long4 __attribute__ ((ext_vector_type(4)));
#else
typedef  double double4 __attribute__ ((vector_size (sizeof(double)*4)));
typedef int64_t   long4 __attribute__ ((vector_size (sizeof(int64_t)*4)));
#endif

double4 select(long4 s, double4 a, double4 b) {
  double4 c;
  #if defined(__GNUC__) && !defined(__INTEL_COMPILER) && !defined(__clang__)
  c = s ? a : b;
  #else
  for(int i=0; i<4; i++) c[i] = s[i] ? a[i] : b[i];
  #endif
  return c;
}

void func(double* left, double* right, double* res, size_t size, double th, double drop) {
  size_t i;
  for (i = 0; i<(size&-4); i+=4) {
    double4 leftv = *(double4*)&left[i];
    double4 rightv = *(double4*)&right[i];
    *(double4*)&res[i] = select(rightv >= th, leftv, leftv - drop);
  }
  for(;i<size; i++) res[i] = right[i] >= th ? left[i] : (left[i] - drop);
}

https://godbolt.org/z/h4OKMl

SIMD for float threshold operation

3 Answers3

If you don't need the specific NaN behaviour of your version, GCC can auto-vectorize too

Linked