Almost every _ss
and _ps
intrinsic / instruction has a double
version with a _sd
or _pd
suffix. (Scalar Double or Packed Double).
For example, search (double
in Intel's intrinsic finder to find intrinsic functions that take a double
as the first arg. Or just figure out what optimal asm would be, then look up the intrinsics for those instructions in the insn ref manual. Except that it doesn't list all the intrinsics for movsd
, so searching for an instruction name in the intrinsics finder often works.
re: header files: always just include <immintrin.h>
. It includes all Intel SSE/AVX intrinsics.
See also ways to put a float
into a vector, and the sse tag wiki for links about how to shuffle vectors. (i.e. the tables of shuffle instructions in Agner Fog's optimizing assembly guide)
(see below for a godbolt link to some interesting compiler output)
re: your sequence
Only use _mm_move_ss
(or sd) if you actually want to merge two vectors.
You don't show how m
is defined. Your use of a
as the variable name for the float and the vector imply that the only useful information in the vector is the float
arg. The variable-name clash of course means it doesn't compile.
There unfortunately doesn't seem to be any way to just "cast" a float
or double
into a vector with garbage in the upper 3 elements, like there is for __m128
-> __m256
:
__m256 _mm256_castps128_ps256 (__m128 a)
. I posted a new question about this limitation with intrinsics: How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
I tried using _mm_undefined_ps()
to achieve this, hoping this would clue in the compiler that it can just leave the incoming high garbage in place, in
// don't use this, it doesn't make better code
__m128d double_to_vec_highgarbage(double x) {
__m128d undef = _mm_undefined_pd();
__m128d x_zeroupper = _mm_set_sd(x);
return _mm_move_sd(undef, x_zeroupper);
}
but clang3.8 compiles it to
# clang3.8 -O3 -march=core2
movq xmm0, xmm0 # xmm0 = xmm0[0],zero
ret
So no advantage, still zeroing the upper half instead of compiling it to just a ret
. gcc actually makes pretty bad code:
double_to_vec_highgarbage: # gcc5.3 -march=nehalem
movsd QWORD PTR [rsp-16], xmm0 # %sfp, x
movsd xmm1, QWORD PTR [rsp-16] # D.26885, %sfp
pxor xmm0, xmm0 # __Y
movsd xmm0, xmm1 # tmp93, D.26885
ret
_mm_set_sd
appears to be the best way to turn a scalar into a vector.
__m128d double_to_vec(double x) {
return _mm_set_sd(x);
}
clang compiles it to a movq xmm0,xmm0
, gcc to a store/reload with -march=generic
.
Other interesting compiler outputs from the float
and double
versions on the Godbolt compiler explorer
float_to_vec: # gcc 5.3 -O3 -march=core2
movd eax, xmm0 # x, x
movd xmm0, eax # D.26867, x
ret
float_to_vec: # gcc5.3 -O3 -march=nehalem
insertps xmm0, xmm0, 0xe # D.26867, x
ret
double_to_vec: # gcc5.3 -O3 -march=nehalem. It could still have use movq or insertps, instead of this longer-latency store-forwarding round trip
movsd QWORD PTR [rsp-16], xmm0 # %sfp, x
movsd xmm0, QWORD PTR [rsp-16] # D.26881, %sfp
ret
float_to_vec: # clang3.8 -O3 -march=core2 or generic (no -march)
xorps xmm1, xmm1
movss xmm1, xmm0 # xmm1 = xmm0[0],xmm1[1,2,3]
movaps xmm0, xmm1
ret
double_to_vec: # clang3.8 -O3 -march=core2, nehalem, or generic (no -march)
movq xmm0, xmm0 # xmm0 = xmm0[0],zero
ret
float_to_vec: # clang3.8 -O3 -march=nehalem
xorps xmm1, xmm1
blendps xmm0, xmm1, 14 # xmm0 = xmm0[0],xmm1[1,2,3]
ret
So both clang and gcc use different strategies for float
vs. double
, even when they could use the same strategy.
Using integer operations like movq
between floating-point operations causes extra bypass delay latency. Using insertps
to zero the upper elements of the input register should be the best strategy for float or double, so all compilers should use that when SSE4.1 is available. xorps + blend is good, too, and can run on more ports than insertps. The store/reload is probably the worst, unless we're bottlenecked on ALU throughput, and latency doesn't matter.