Using gcc 7.2 at godbolt.org I can see the following code is translated in assembler quite optimally. I see 1 load, 1 addition and 1 store.
#include <immintrin.h>
__attribute__((alwaysinline)) double foo(double x, double y)
{
return x+y;
}
void usefoo(double x, double *y, double *z)
{
*z = foo(x, *y);
}
which results in:
usefoo(double, double*, double*):
addsd xmm0, QWORD PTR [rdi]
movsd QWORD PTR [rsi], xmm0
ret
However, if I try and achieve the same using intrinsics and template with the code below, I can see some overhead is added. In particular, what is the point of the instruction: movq xmm0, xmm0
?
#include <immintrin.h>
__attribute__((alwaysinline)) double foo(double x, double y)
{
return _mm_cvtsd_f64(_mm_add_sd(__m128d{x}, __m128d{y}));
}
void usefoo(double x, double *y, double *z)
{
*z = foo(x, *y);
}
which results in:
usefoo(double, double*, double*):
movq xmm1, QWORD PTR [rdi]
movq xmm0, xmm0
addsd xmm0, xmm1
movlpd QWORD PTR [rsi], xmm0
ret
How can I achieve with scalar intrinsics a code equivalent to what the compiler would generate otherwise?
If you wonder why I may want to do that, think about replacing +
with <=
: if I write x<y
the compiler converts the results to bool, while the intrinsic would keep it as a double bitmask. Hence for my use case, writing x<y
is not an option. However using +
was simple enough to illustrate the question.