It's not std::vector
that's the problem, it's float
and GCC's usually-bad default of -ftrapping-math
that is supposed to treat FP exceptions as a visible side-effect, but doesn't always correctly do that, and misses some optimizations that would be safe.
In this case, there is a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.
GCC does that correctly in this case using scalar code: ...ss
is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn't GCC's actual output: it loads both elements with vmovss
, then branches on a vcomiss
result before vmulss
, so the multiply doesn't happen if b[i] > c[i]
isn't true. So unlike your "GCC" asm, GCC's actual asm does I think correctly implement -ftrapping-math
.
Notice that your example which does auto-vectorize uses int *
args, not float*
. If you change it to float*
and use the same compiler options, it doesn't auto-vectorize either, even with float *__restrict a
(https://godbolt.org/z/nPzsf377b).
@273K's answer shows that AVX-512 lets float
auto-vectorize even with -ftrapping-math
, since AVX-512 masking (ymm2{k1}{z}
) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don't happen in the C++ abstract machine.
gcc -O3 -mavx2 -mfma -fno-trapping-math
auto-vectorizes all 3 functions (Godbolt)
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
foo(float*, float*, float*):
xor eax, eax
.L143:
vmovups ymm2, YMMWORD PTR [rsi+rax]
vmovups ymm3, YMMWORD PTR [rdx+rax]
vmulps ymm1, ymm2, YMMWORD PTR [rdx+rax]
vcmpltps ymm0, ymm3, ymm2
vandps ymm0, ymm0, ymm1
vmovups YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 1024
jne .L143
vzeroupper
ret
BTW, I'd recommend -march=x86-64-v3
for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic
I think, but could hopefully in future ignore tuning things that only matter for CPUs that don't have AVX2+FMA+BMI2.
The std::vector
functions are bulkier since we didn't use float *__restrict a = avec.data();
or similar to promise non-overlap of the data pointed-to by the std::vector
control blocks (and the size isn't known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps
/ vcmpltps
/ vandps
.
See also:
Tweaking the source to make the multiply unconditional? No
If the multiply in the C source happens regardless of the condition, then GCC would be allowed to vectorize it the efficient way without AVX-512 masking.
// still scalar asm with GCC -ftrapping-math which is a bug
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
float prod = b[i] * c[i];
a[i] = (b[i] > c[i]) ? prod : 0;
}
}
But unfortunately GCC -O3 -march=x86-64-v3
(Godbolt with and without the default -ftrapping-math
) still makes scalar asm that only conditionally multiplies!
This is a bug in -ftrapping-math
. Not only is it too conservative, missing the chance to auto-vectorize: It's actually buggy, not raising FP exceptions for some multiplies the abstract machine (or a debug build) actually performs. Crap behaviour like this is why -ftrapping-math
is unreliable and probably shouldn't be on by default.
@Ovinus Real's answer points out GCC -ftrapping-math
could still have auto-vectorized the original source by masking both inputs instead of the output. 0.0 * 0.0
never raises any FP exceptions, so it's basically emulating AVX-512 zero-masking.
This would be more expensive and have more latency for out-of-order exec to hide, but is still much better than scalar especially when AVX1 is available, especially for small to medium arrays that are hot in some level of cache.
(If writing with intrinsics, just mask the output to zero unless you actually want to check the FP environment for exception flags after the loop.)
Doing this in scalar source doesn't lead GCC into making asm like that: GCC compiles this to the same branchy scalar asm unless you use -fno-trapping-math
. At least that's not a bug this time, just a missed optimization: this doesn't do b[i]*c[i]
when the compare is false.
// doesn't help, still scalar asm with GCC -ftrapping-math
void bar (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
float bi = b[i];
float ci = c[i];
if (! (bi > ci)) {
bi = ci = 0;
}
a[i] = bi * ci;
}
}