30

Consider following float loop, compiled using -O3 -mavx2 -mfma

for (auto i = 0; i < a.size(); ++i) {
    a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}

Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.

.LBB0_7:
        vcmpltps        ymm2, ymm1, ymm0
        vmulps  ymm0, ymm0, ymm1
        vandps  ymm0, ymm2, ymm0

GCC, however, is much worse. For some reason it doesn't get better than SSE 128-bit vectors (-mprefer-vector-width=256 won't change anything).

.L6:
        vcomiss xmm0, xmm1
        vmulss  xmm0, xmm0, xmm1
        vmovss  DWORD PTR [rcx+rax*4], xmm0

If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.

int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
  int i;
  for (i=0; i<256; i++){
    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
  }
}

However I didn't find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?

Source on Godbolt with gcc 13.1 and clang 14.0.0

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Vladislav Kogan
  • 561
  • 6
  • 15
  • 5
    BTW the SSE code wasn't really using 128-bit vectors as such, it's scalar code (with the `ss` suffix standing for 'scalar, single precision'). If it was actually *vectorized* with SSE, the suffixes would be `ps`. – harold Jul 14 '23 at 11:17
  • 1
    Also, that's not SSE, it's using the AVX encodings (`vcomiss` scalar compare, not legacy-SSE `comiss`) with 128-bit XMM registers, because you enabled AVX. – Peter Cordes Jul 15 '23 at 15:43

3 Answers3

35

It's not std::vector that's the problem, it's float and GCC's usually-bad default of -ftrapping-math that is supposed to treat FP exceptions as a visible side-effect, but doesn't always correctly do that, and misses some optimizations that would be safe.

In this case, there is a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.

GCC does that correctly in this case using scalar code: ...ss is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn't GCC's actual output: it loads both elements with vmovss, then branches on a vcomiss result before vmulss, so the multiply doesn't happen if b[i] > c[i] isn't true. So unlike your "GCC" asm, GCC's actual asm does I think correctly implement -ftrapping-math.

Notice that your example which does auto-vectorize uses int * args, not float*. If you change it to float* and use the same compiler options, it doesn't auto-vectorize either, even with float *__restrict a (https://godbolt.org/z/nPzsf377b).

@273K's answer shows that AVX-512 lets float auto-vectorize even with -ftrapping-math, since AVX-512 masking (ymm2{k1}{z}) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don't happen in the C++ abstract machine.


gcc -O3 -mavx2 -mfma -fno-trapping-math auto-vectorizes all 3 functions (Godbolt)

void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
  for (int i=0; i<256; i++){
    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
  }
}
foo(float*, float*, float*):
        xor     eax, eax
.L143:
        vmovups ymm2, YMMWORD PTR [rsi+rax]
        vmovups ymm3, YMMWORD PTR [rdx+rax]
        vmulps  ymm1, ymm2, YMMWORD PTR [rdx+rax]
        vcmpltps        ymm0, ymm3, ymm2
        vandps  ymm0, ymm0, ymm1
        vmovups YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rax, 1024
        jne     .L143
        vzeroupper
        ret

BTW, I'd recommend -march=x86-64-v3 for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic I think, but could hopefully in future ignore tuning things that only matter for CPUs that don't have AVX2+FMA+BMI2.

The std::vector functions are bulkier since we didn't use float *__restrict a = avec.data(); or similar to promise non-overlap of the data pointed-to by the std::vector control blocks (and the size isn't known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps / vcmpltps / vandps.


See also:


Tweaking the source to make the multiply unconditional? No

If the multiply in the C source happens regardless of the condition, then GCC would be allowed to vectorize it the efficient way without AVX-512 masking.

// still scalar asm with GCC -ftrapping-math which is a bug
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
  for (int i=0; i<256; i++){
    float prod = b[i] * c[i];
    a[i] =  (b[i] > c[i]) ? prod : 0;
  }
}

But unfortunately GCC -O3 -march=x86-64-v3 (Godbolt with and without the default -ftrapping-math) still makes scalar asm that only conditionally multiplies!

This is a bug in -ftrapping-math. Not only is it too conservative, missing the chance to auto-vectorize: It's actually buggy, not raising FP exceptions for some multiplies the abstract machine (or a debug build) actually performs. Crap behaviour like this is why -ftrapping-math is unreliable and probably shouldn't be on by default.


@Ovinus Real's answer points out GCC -ftrapping-math could still have auto-vectorized the original source by masking both inputs instead of the output. 0.0 * 0.0 never raises any FP exceptions, so it's basically emulating AVX-512 zero-masking.

This would be more expensive and have more latency for out-of-order exec to hide, but is still much better than scalar especially when AVX1 is available, especially for small to medium arrays that are hot in some level of cache.

(If writing with intrinsics, just mask the output to zero unless you actually want to check the FP environment for exception flags after the loop.)

Doing this in scalar source doesn't lead GCC into making asm like that: GCC compiles this to the same branchy scalar asm unless you use -fno-trapping-math. At least that's not a bug this time, just a missed optimization: this doesn't do b[i]*c[i] when the compare is false.

// doesn't help, still scalar asm with GCC -ftrapping-math
void bar (float *__restrict a, float *__restrict b, float *__restrict c) {
  for (int i=0; i<256; i++){
    float bi = b[i];
    float ci = c[i];
    if (! (bi > ci)) {
        bi = ci = 0;
    }
    a[i] = bi * ci;
  }
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
20

GCC by default compiles for older CPU architectures.

Setting -march=native enables using 256-bit ymm registers.

.L7:
        vmovups ymm1, YMMWORD PTR [rsi+rax]
        vmovups ymm0, YMMWORD PTR [rdx+rax]
        vcmpps  k1, ymm1, ymm0, 14
        vmulps  ymm2{k1}{z}, ymm1, ymm0
        vmovups YMMWORD PTR [rcx+rax], ymm2

Setting -march=x86-64-v4 enables using 512-bit zmm registers.

.L7:
        vmovups zmm2, ZMMWORD PTR [rsi+rax]
        vcmpps  k1, zmm2, ZMMWORD PTR [rdx+rax], 14
        vmulps  zmm0{k1}{z}, zmm2, ZMMWORD PTR [rdx+rax]
        vmovups ZMMWORD PTR [rcx+rax], zmm0
273K
  • 29,503
  • 10
  • 41
  • 64
  • Thanks. Yes, I tested with -mavx512f (both of your answers implicitly use this flag) before asking a question. It's still strange that gcc gives either SSE or AVX512F assembly without AVX/AVX2 as intermediate. For example, -march=skylake or -march=x86-64-v3 won't use avx/avx2 despite having latter one. – Vladislav Kogan Jul 14 '23 at 00:16
  • 1
    Yes, agree, it's strange, GCC makes one big step forward without intermediate smaller steps. – 273K Jul 14 '23 at 00:18
  • 10
    @VladislavKogan: AVX-512 masking suppresses FP exceptions from masked elements, letting GCC make vectorized asm that respects `-ftrapping-math` (which is on by default). That's why it can vectorize with AVX-512 but not earlier extensions if you don't turn off `-ftrapping-math`. BTW, `-march=native` allowing 256-bit vectorization only applies on CPUs with AVX-512, like Ice Lake and Zen 4. (On most CPUs the default is `-mprefer-vector-width=256`, but apparently `-march=x86-64-v4` prefers vector-width=512.) – Peter Cordes Jul 14 '23 at 01:54
1

Assuming -ftrapping-math, another option is the zero the ignored inputs before multiplying them (untested):

for (size_t i = 0; i < size; i += 4) {
    __m128i x = _mm_loadu_si128((const __m128i*)(a + i));
    __m128i y = _mm_loadu_si128((const __m128i*)(b + i));
    __m128i cmp = _mm_cmplt_ps(x, y);
    
    x = _mm_and_ps(x, cmp);
    y = _mm_and_ps(y, cmp);

    _mm_storeu_si128((__m128i*)(a + i), _mm_mul_ps(x, y));
}

This of course translates to larger widths.

Both inputs must be zeroed, because +0.0 * x is -0.0 if x < 0. On some processors this will probably have the same throughput as other solutions of the same vector width. This same method will work for addition, subtraction, and square root. Division will require a divisor other than zero.

Even under -fno-trapping-math, this solution might be slightly superior to one masking after the multiplication, because it avoids penalties associated with ignored inputs that require microcoded multiplication. But I'm not sure whether the throughput can be the same as a version which zeroes after the multiplication.

Ovinus Real
  • 528
  • 3
  • 10
  • 1
    You mean asm like what that might compile to? Obviously if you're writing intrinsics by hand, you would just do it efficiently with the compare in parallel with multiply, and one `and` or `andn`, unless you actually were going to check the FP environment afterwards. What I might do is write the source to make the multiply unconditional, like `tmp = b[i]*c[i];` and then use it. This could be vectorized easily, but this is one of those missed-optimization cases where GCC `-ftrapping-math` sucks and we get scalar code: https://godbolt.org/z/zMrvrh8EG – Peter Cordes Jul 15 '23 at 23:44
  • 1
    (And if the number of FP exceptions raised is supposed to be the same as the source, `-ftrapping-math` doesn't preserve that when it still vectorizes an unconditional `a[i] = b[i] * c[i]`. It makes no sense that that vectorizes but not the conditional selecting between `prod` and `0`, especially when it is willing to use `vcmpps` with AVX-512. Good example of `-ftrapping-math` being inconsistent and bad.) – Peter Cordes Jul 15 '23 at 23:47
  • 1
    Anyway, yes, if GCC did want to vectorize while respecting `-ftrapping-math`, it could do it this way. `0 * 0` doesn't raise any FP exceptions, I'm pretty sure not even denormal or underflow even though `0.0` is encoded with an exponent field of all-zero. – Peter Cordes Jul 15 '23 at 23:50
  • Ah; given how GCC vectorized it with `vcmpps` I assumed that -ftrapping-math only required that some exception be raised at some point, not that each exception would have to be raised in turn. Which is odd, of course. – Ovinus Real Jul 16 '23 at 19:07
  • I'm not sure what `-ftrapping-math` is *intended* to do if FP exceptions are unmasked. The "trap" name implies they should actual trap, not just set sticky bits, but perhaps that name just sounded better than `-fmath-exceptions` or something. As you say, In practice the total number of instructions that raise FP exceptions is *not* preserved by `-ftrapping-math` when it auto-vectorizes something without any conditional behaviour. https://gcc.gnu.org/wiki/FloatingPointMath implies that it's supposed to honour "exception flags and traps" when `-fsignaling-nans` is also enabled (same asm here) – Peter Cordes Jul 16 '23 at 20:08
  • So I guess an exception handler that increments a counter and repairs the situation is *not* supported. (That couldn't easily be written in pure C anyway; a signal handler wouldn't know what main-thread registers or memory had the operands that lead to a fault even if you had `volatile double` vars (since it might not use a memory source operand, and couldn't on a RISC), and it would be target-specific.) So I assume it intends to preserve the fact that 0 vs. a non-0 number of FP exceptions happen in a block or function, but doesn't always succeed at that. – Peter Cordes Jul 16 '23 at 20:12