how to minimize overhead loading double into simd regsiters working with scalar SIMD intrinsics

Question

Using gcc 7.2 at godbolt.org I can see the following code is translated in assembler quite optimally. I see 1 load, 1 addition and 1 store.

#include <immintrin.h>

__attribute__((alwaysinline)) double foo(double x, double y)
{
    return x+y;
}

void usefoo(double x, double *y, double *z)
{
    *z = foo(x, *y);
}

which results in:

usefoo(double, double*, double*):
   addsd xmm0, QWORD PTR [rdi]
   movsd QWORD PTR [rsi], xmm0
   ret

However, if I try and achieve the same using intrinsics and template with the code below, I can see some overhead is added. In particular, what is the point of the instruction: movq xmm0, xmm0 ?

#include <immintrin.h>

__attribute__((alwaysinline)) double foo(double x, double y)
{
    return _mm_cvtsd_f64(_mm_add_sd(__m128d{x}, __m128d{y}));
}

void usefoo(double x, double *y, double *z)
{
    *z = foo(x, *y);
}

which results in:

usefoo(double, double*, double*):
  movq xmm1, QWORD PTR [rdi]
  movq xmm0, xmm0
  addsd xmm0, xmm1
  movlpd QWORD PTR [rsi], xmm0
  ret

How can I achieve with scalar intrinsics a code equivalent to what the compiler would generate otherwise?

If you wonder why I may want to do that, think about replacing + with <=: if I write x<y the compiler converts the results to bool, while the intrinsic would keep it as a double bitmask. Hence for my use case, writing x<y is not an option. However using + was simple enough to illustrate the question.

Could you please elaborate? Perhaps just lucky, but I have been using that in VS and gcc for over 10 years and never experienced a problem. The only difference I typically use `double d[2]`. Not sure about gcc, but VS explicitly defines such unions in header files. What would you suggest otherwise? — Fabio, Jan 02 '18 at 05:42
It is undefined behavior, according to the rules of the C++ language, to read from a union member other than the one last written to. There's an exception carved out allowing access to "common initial subsequence" when the union members are struct-typed, which permission doesn't apply here. If you want to pull a `double` out of a `__m128d` value, I suggest you use the intrinsic which exactly does that. — Ben Voigt, Jan 02 '18 at 05:45
Not 100% sure but I think that `_mm_cvtsd_f64()` is the one you are looking for. — Ben Voigt, Jan 02 '18 at 05:47
@BenVoigt: GNU C defines the behaviour of union type-punning in C++ (and C89), as an extension to the ISO standards (where it's only defined in C99/C11). Not that it's a good idea in this case... IDK if MSVC defines the behaviour of union type-punning, but I wouldn't be surprised if it's explicitly/intentionally supported there, because a lot of code doesn't use `memcpy` for type punning. — Peter Cordes, Jan 02 '18 at 05:49
@Peter: Perhaps, although I can't find where in the G++ documentation that is guaranteed. But it seems like a bad idea to rely on vendor-specific behavior when the Intel intrinsic API provides a clean way to do it, and the program already has a dependency on Intel intrinsics. — Ben Voigt, Jan 02 '18 at 05:55
In the past I observed that the use `_mm_cvtsd_f64` was resulting in the generation of two separate instructions, the extraction and the store. Now I can see that gcc 7.2 it does not make a difference as the optimizer can see it through. So I edited the question accordingly using `_mm_cvtsd_f64`, which has the nice benefit to simplify the source code. The assembler code generated does not change. — Fabio, Jan 02 '18 at 06:02
@BenVoigt: It's documented here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Type%2Dpunning. That documentation isn't 100% clear that it applies to C++, rather than just C90, but [John Regehr agrees that GNU C explicitly supports union type-punning in C++.](https://blog.regehr.org/archives/959). Anyway, I totally agree that a union is a terrible idea compared to `_mm_cvtsd_f64`. The only sensible use here might be to work around the lack of an intrinsic for `double` ->`__m128d` leaving the upper half undefined instead of zeroed. https://stackoverflow.com/q/39318496/224132 — Peter Cordes, Jan 02 '18 at 06:11
Just as a side-note: You are aware that most modern versions of compilers are quite good at translating scalar code to vector code by themselves, so using various intrinsics is often making the code worse, right? — Mats Petersson, Jan 02 '18 at 06:53
@Mats Petersson: Yes I am. Still does not work that well when it comes to conditionals. Example `x[i]>y[i]? x[i]-y[i]: 0`. Using intrinsic you can assure this is compiled into a branch free fully vectorized loop, using the bitmasks returned by SIMD comparison operations and sacrificing short circuit boolean evaluation. I am not sure if there is a way to instruct the compiler vectorizer to do so. — Fabio, Jan 02 '18 at 07:29
Ah, that wasn't something I was aware of. Have you tried using Clang - I'm not making any promises, but I'd have thought it would manage that. — Mats Petersson, Jan 02 '18 at 07:33
@MatsPetersson Auto-vectorization is generally still inferior to intrinsics in the hands of anybody who is even slightly competent about what they're doing. The fact is that compiler vectorization has a long way to go and is severely hindered the lack "high-level" information that only the programmer knows. — Mysticial, Jan 02 '18 at 17:19

score 3 · Answer 1 · answered Jan 02 '18 at 06:00

3

The "extraneous" movq is clearing the second element in the __m128d, as you requested by the list-initialization __m128d{x}.

When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM register, the quadword is stored to the low quadword of the register, and the high quadword is cleared to all 0s.

Remember that when fewer initializers are supplied than there are members, all remaining members are value-initialized (to zero).

I would expect a higher level of optimization to see that the second element is never used, and remove the extraneous instruction. On the other hand, even though unused, the second value cannot be allowed to trap during the addition operation, and clearing it explicitly may be the safest way to ensure it does not.

answered Jan 02 '18 at 06:00

Ben Voigt

277,958
43
419
720

`Remember that when fewer initializers are supplied than there are members, all remaining members are value-initialized (to zero)`. That is exactly the part I am wrestling to avoid, and it is the part I do not like about `_mm_load_sd`. Is there a way to do that? In the end, if I am about to use an `_sd` operation, why would I care about the content of the other half of the registry? In fact, if the compiler has to put a double in a __m128d to do a scalar operation, it leaves the other half undefined. But not when using intrinsics. – Fabio Jan 02 '18 at 06:07
@Fabio: Did you see the question Peter linked (which he asked himself long ago)? It seems to address the part we now know is the crux of your concern. – Ben Voigt Jan 02 '18 at 06:27
Yes it does. Spot on! Thank you all. – Fabio Jan 02 '18 at 06:30
I was working on an answer to this with some more details, but for now I'm just going to leave a comment with a [Godbolt link with some experiments to see what kind of a mess compilers make with different ways of getting scalars into `__m128`](https://godbolt.org/g/qFShdu), e.g. union vs. intrinsic. (Some manage to optimize away the zero-extension to a full register). Writing to part of a union and then reading a wider member seems to be really bad for gcc. – Peter Cordes Mar 11 '18 at 06:30

how to minimize overhead loading double into simd regsiters working with scalar SIMD intrinsics

1 Answers1