Issues of compiler generated assembly for intrinsics

Question

I'm using Intel SSE/AVX/FMA intrinsics to achieve perfectly inlining SSE/AVX instructions for some math functions.

Given the following code

#include <cmath>
#include <immintrin.h>

auto std_fma(float x, float y, float z)
{
    return std::fma(x, y, z);
}

float _fma(float x, float y, float z)
{
    _mm_store_ss(&x,
        _mm_fmadd_ss(_mm_load_ss(&x), _mm_load_ss(&y), _mm_load_ss(&z))
    );

    return x;
}

float _sqrt(float x)
{
    _mm_store_ss(&x,
        _mm_sqrt_ss(_mm_load_ss(&x))
    );

    return x;
}

the clang 3.9 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float):                          # @std_fma(float, float, float)
        vfmadd213ss     xmm0, xmm1, xmm2
        ret

_fma(float, float, float):                             # @_fma(float, float, float)
        vxorps  xmm3, xmm3, xmm3
        vmovss  xmm0, xmm3, xmm0        # xmm0 = xmm0[0],xmm3[1,2,3]
        vmovss  xmm1, xmm3, xmm1        # xmm1 = xmm1[0],xmm3[1,2,3]
        vmovss  xmm2, xmm3, xmm2        # xmm2 = xmm2[0],xmm3[1,2,3]
        vfmadd213ss     xmm0, xmm1, xmm2
        ret

_sqrt(float):                              # @_sqrt(float)
        vsqrtss xmm0, xmm0, xmm0
        ret

while the generated code for _sqrt is fine, there are unnecessary vxorps (which sets the absolutely unused xmm3 register to zero) and movss instructions in _fma compared to std_fma (which rely on compiler intrinsic std::fma)

the GCC 6.2 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float):
        vfmadd132ss     xmm0, xmm2, xmm1
        ret
_fma(float, float, float):
        vinsertps       xmm1, xmm1, xmm1, 0xe
        vinsertps       xmm2, xmm2, xmm2, 0xe
        vinsertps       xmm0, xmm0, xmm0, 0xe
        vfmadd132ss     xmm0, xmm2, xmm1
        ret
_sqrt(float):
        vinsertps       xmm0, xmm0, xmm0, 0xe
        vsqrtss xmm0, xmm0, xmm0
        ret

and here are a lot of unnecessary vinsertps instructions

Working example: https://godbolt.org/g/q1BQym

The default x64 calling convention pass floating-point function arguments in XMM registers, so those vmovss and vinsertps instructions should be eliminated. Why do the mentioned compilers still emit them? Is it possible to get rid of them without inline assembly?

I also tried to use _mm_cvtss_f32 instead of _mm_store_ss and multiple calling conventions, but nothing changed.

The result of the intrinsic `_mm_load_ss` is a 128-bit vector with the 32-bit floating point value in the first element, and zeroes in the other three elements. That's what the unnecessary instructions are doing, setting the other three elements to zero. The compilers aren't smart enough to detect that those elements are never used and ultimately discarded when the function returns, but they're doing what you asked it to do. It seems you already have the perfect solution for the FMA case however. — Ross Ridge, Nov 04 '16 at 07:24
This is really bad, the compilers should know that since I use `*_ss` intrinsics. — plasmacel, Nov 04 '16 at 07:26
AFAIK, the only solution is not to do that (and I think this is a duplicate of http://stackoverflow.com/questions/39318496/how-to-merge-a-scalar-into-a-vector-without-the-compiler-wasting-an-instruction). Clang in some cases sees that upper elements are unused and can avoid touching them (see that linked question). You can get the compiler to use FMA when applicable for scalar code using an option (not just `-mfma` or `-ffast-math`) but I forget what and don't have time to look it up right now. Since `std::fma` inlines perfectly, just use it. — Peter Cordes, Nov 04 '16 at 08:07
It's shame that if you want perfect inline and optimized intrinsics, you need to write them yourself by inline assembly (which is not allowed in x64 VC++). — plasmacel, Nov 04 '16 at 08:12
Yes, agreed that Intel's intrinsics are not perfectly designed. Of course, writing inline asm can defeat optimizations like CSE and constant propagation, so it's potentially much worse. See https://gcc.gnu.org/wiki/DontUseInlineAsm. (and the section I just added to [my recent collatz-conjecture asm answer](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466)). MSVC-style inline asm is terrible for wrapping single instructions: the inputs/outputs have to go through a store/reload round trip (~5c latency) — Peter Cordes, Nov 04 '16 at 08:15
@PeterCordes Do such short inline asm `asm("vsqrtss %0, %0, %0" :"+x"(x));` defeat optimization? — plasmacel, Nov 04 '16 at 08:19
What exactly is the issue here? Did you want to use an intrinsic for something the compiler can't just generate for you? Like a SIMD-integer instruction to mess with the bits of a `float`? — Peter Cordes, Nov 04 '16 at 08:19
@PeterCordes I want some functions that guarantee across different compilers that the corresponding SSE/AVX instruction will be emitted, instead of a call to an `std::` mumbo-jumbo function. — plasmacel, Nov 04 '16 at 08:21
@plasmacel: yes, absolutely. If `x` is a compile-time constant after inlining, `sqrt(x)` is evaluated at compile time. But with inline-asm, the compiler will emit `movss xmm0, .LC0` / `vsqrtss xmm0, xmm0, xmm0` or something. It can't fold the load into a memory operand, because you used an `"x"` constraint, but maybe it would be better to not do that in some cases because of the false dependency on the upper elements of the destination... All things the compiler will consider when emitting it itself. It also cant transform it into `vrsqrtss` + newton iterations with `-ffast-math` — Peter Cordes, Nov 04 '16 at 08:22
Also, the compiler loses out on the information that the result is non-negative (at least if the compiler knows the input isn't NaN.) This could let it optimize something else. It all depends on what this code is inlining into. For this specific case, probably the main potential downside no constant propagation. IIRC, since it's not `volatile`, gcc can treat it as a pure function that only depends on its inputs, and CSE it if you evaluate sqrt of the same input multiple times. — Peter Cordes, Nov 04 '16 at 08:25
["Don't use inline asm"](https://gcc.gnu.org/wiki/DontUseInlineAsm) isn't an absolute rule, it's just something you should be aware of and consider carefully before using. If the alternatives don't meet your requirements, and you don't end up with this inlining into places where it can't optimize, then go right ahead. Oh BTW, I would use `asm("vsqrtss %1, %1, %0" :"=x"(result): "x"(input));` (assuming AT&T syntax operand order) so the compiler can take advantage of the non-destructive behaviour of AVX instructions to avoid a MOVAPS if it still needs the original value of `x` afterwards. — Peter Cordes, Nov 04 '16 at 08:29
@PeterCordes Doesn't `asm("vsqrtss %1, %1, %0" :"+x"(x));` do the same as `asm("vsqrtss %1, %1, %0" :"=x"(x) : "x"(x));`? — plasmacel, Nov 04 '16 at 08:34
But why do you think `std::fma` is "mumbo jumbo". One of its reasons for existing is to expose this hardware functionality to programmers. Did you ever see `std::fma` fail to inline? http://en.cppreference.com/w/cpp/numeric/math/fma says it follows the standard error-handling behaviour, so possibly it's required to set `errno` on NaN, which would stop it from simply inlining, but your results show clang and gcc inlining it without -ffast-math, and they follow the rules for `sqrtf()`. (They actually inline `sqrtss` and branch on the result being NaN to an actual function call.) — Peter Cordes, Nov 04 '16 at 08:35
`asm("vsqrtss %1, %1, %0" :"+x"(x));` won't compile. The version with all operands having the same number means that all three `%0` operands will always expand to the same thing, not allowing the compiler to choose a different output register. — Peter Cordes, Nov 04 '16 at 08:37
@PeterCordes In some compilers (like VS2015) `std::fma` is implemented as a non-intrinsic function instead of a single FMA instruction, which is terribly slow. Even `std::sqrt` in GCC 6.2 is much more than a single `vsqrtss`. — plasmacel, Nov 04 '16 at 08:38
I already described exactly how `std::sqrtf` compiles in my previous comment! It's extra instructions, but at least they're off the critical path for latency. Anyway, thanks for confirming that some compilers suck at `std::fma`, so it doesn't solve this problem for all compilers. — Peter Cordes, Nov 04 '16 at 08:46
Turns out using `%1` with a single `+x` operand does compile (IDK how or why), but it doesn't help. Here's [a godbolt link](https://godbolt.org/g/VtNMLL) that demonstrates exactly what I'm talking about, where my way saves a VMOVAPS instruction. Also note that if you don't enable `-mavx`, gcc is mixing AVX-128 and SSE instructions (which is actually safe, unless the CPU was already in state C). Also, this code still won't run on non-AVX CPUs, because you forced usage of the VEX-coded version with the v version of the asm mnemonic) — Peter Cordes, Nov 04 '16 at 08:48
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/127358/discussion-between-plasmacel-and-peter-cordes). — plasmacel, Nov 04 '16 at 08:50
@plasmacel: nice. Just got back from a trip to see my brother's dinner theatre show, so I was away from SO for the past day, and have some catching up to do. — Peter Cordes, Nov 05 '16 at 22:34

plasmacel · Accepted Answer · 2016-11-05T00:31:52.687

I write this answer based on the comments, some discussion and my own experiences.

As Ross Ridge pointed out in the comments, the compiler is not smart enough to recognize that only the lowest floating-point element of the XMM register is used, so it do zero out the other three elements with those vxorps vinsertps instructions. This is absolutely unnecessary, but what can you do?

Need to note that clang 3.9 does much better job than GCC 6.2 (or current snapshot of 7.0) at generating assembly for Intel intrinsics, since it only fails at _mm_fmadd_ss in my example. I tested more intrinsics as well and in most cases clang did perfect job to emit single instructions.

What can you do

You can use the standard <cmath> functions, with the hope that they are defined as compiler intrinsics if a proper CPU instructions is available.

This is not enough

Compilers, like GCC implement these functions with special handling of NaN and infinities. So in addition to the intrinsics, they can do some comparison, branching, and possible errno flag handling.

Compiler flags -fno-math-errno -fno-trapping-math do help GCC and clang to eliminate the additional floating-point special cases and errno handling, so they can emit single instructions if possible: https://godbolt.org/g/LZJyaB.

You can achieve the same with -ffast-math, since it also includes the above flags, but it includes much more than that, and those (like unsafe math optimizations) are probably not desired.

Unfortunately this is not a portable solution. It works in most cases (see the godbolt link), but still, you depend on the implementation.

What more

You can yet use inline assembly, which is also not portable, much more tricky and there are much more things to consider. In spite of that, for such simple one-line instructions it can be okay.

Things to consider:

1st GCC/clang and Visual Studio use different syntax for inline assembly, and Visual Studio doesn't allow it in x64 mode.

2nd You need to emit VEX encoded instructions (3 op variants, e.g. vsqrtss xmm0 xmm1 xmm2) for AVX targets, and non-VEX encoded (2 op variants, e.g. sqrtss xmm0 xmm1) variants for pre-AVX CPUs. VEX encoded instructions are 3 operand instructions, so they offer more freedom for the compiler to optimize. To take their advantage, register input/output parameters must be set properly. So something like below does the job.

#   if __AVX__
    asm("vsqrtss %1, %1, %0" :"=x"(x) : "x"(x));
#   else
    asm("sqrtss %1, %0" :"=x"(x) : "x"(x));
#   endif

But the following is a bad technique for VEX:

asm("vsqrtss %1, %1, %0" :"+x"(x));

It can yield to an unnecessary move instruction, check https://godbolt.org/g/VtNMLL.

3rd As Peter Cordes pointed out, you can lose common subexpression elimination (CSE) and constant folding (constant propagation) for inline assembly functions. However if the inline asm is not declared as volatile, the compiler can treat it as a pure function which depends only on its inputs and perform common subexpression elimination, which is great.

As Peter said:

"Don't use inline asm" isn't an absolute rule, it's just something you should be aware of and consider carefully before using. If the alternatives don't meet your requirements, and you don't end up with this inlining into places where it can't optimize, then go right ahead.

Issues of compiler generated assembly for intrinsics

1 Answers1

Linked