Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

Question

Over the years, a few times I have seen intrinsics functions with in float parameters that get transformed to __m128 with the following code: __m128 b = _mm_move_ss(m, _mm_set_ss(a));.

For instance:

void MyFunction(float y)
{
    __m128 a = _mm_move_ss(m, _mm_set_ss(y)); //m is __m128
    //do whatever it is with 'a'
}

I wonder if there is a similar way of using _mm_move and _mm_set intrinsics to do the same for doubles (__m128d)?

What's `m`? You know this is a merge, right? It's not simply making a vector out of a `float`. — Peter Cordes, Aug 16 '16 at 02:42
@PeterCordes Yap, you are correct. I wrote the code sample by memory, but indeed it's `y` that is the function parameter. And `m` is a `__m128` vector that is declared and assigned outside the function. — user123443563, Aug 16 '16 at 03:28

score 2 · Answer 1 · answered Aug 15 '16 at 22:32

2

_mm_move_sd, _mm_set_sd. They're SSE2 intrinsics (and not SSE), so you'll need #include <emmintrin.h>.

answered Aug 15 '16 at 22:32

peppe

21,934
4
55
70

You can always just `#include `. No need to mess with all the legacy names for header files for different extensions. – Peter Cordes Aug 16 '16 at 02:37

score 2 · Accepted Answer · edited May 23 '17 at 12:00

Almost every _ss and _ps intrinsic / instruction has a double version with a _sd or _pd suffix. (Scalar Double or Packed Double).

For example, search (double in Intel's intrinsic finder to find intrinsic functions that take a double as the first arg. Or just figure out what optimal asm would be, then look up the intrinsics for those instructions in the insn ref manual. Except that it doesn't list all the intrinsics for movsd, so searching for an instruction name in the intrinsics finder often works.

re: header files: always just include <immintrin.h>. It includes all Intel SSE/AVX intrinsics.

See also ways to put a float into a vector, and the sse tag wiki for links about how to shuffle vectors. (i.e. the tables of shuffle instructions in Agner Fog's optimizing assembly guide)

(see below for a godbolt link to some interesting compiler output)

re: your sequence

Only use _mm_move_ss (or sd) if you actually want to merge two vectors.

You don't show how m is defined. Your use of a as the variable name for the float and the vector imply that the only useful information in the vector is the float arg. The variable-name clash of course means it doesn't compile.

There unfortunately doesn't seem to be any way to just "cast" a float or double into a vector with garbage in the upper 3 elements, like there is for __m128 -> __m256:
__m256 _mm256_castps128_ps256 (__m128 a). I posted a new question about this limitation with intrinsics: How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

I tried using _mm_undefined_ps() to achieve this, hoping this would clue in the compiler that it can just leave the incoming high garbage in place, in

// don't use this, it doesn't make better code
__m128d double_to_vec_highgarbage(double x) {
  __m128d undef = _mm_undefined_pd();
  __m128d x_zeroupper = _mm_set_sd(x);
  return _mm_move_sd(undef, x_zeroupper);
}

but clang3.8 compiles it to

    # clang3.8 -O3 -march=core2
    movq    xmm0, xmm0              # xmm0 = xmm0[0],zero
    ret

So no advantage, still zeroing the upper half instead of compiling it to just a ret. gcc actually makes pretty bad code:

double_to_vec_highgarbage:  # gcc5.3 -march=nehalem
    movsd   QWORD PTR [rsp-16], xmm0      # %sfp, x
    movsd   xmm1, QWORD PTR [rsp-16]      # D.26885, %sfp
    pxor    xmm0, xmm0      # __Y
    movsd   xmm0, xmm1    # tmp93, D.26885
    ret

_mm_set_sd appears to be the best way to turn a scalar into a vector.

__m128d double_to_vec(double x) {
  return _mm_set_sd(x);
}

clang compiles it to a movq xmm0,xmm0, gcc to a store/reload with -march=generic.

Other interesting compiler outputs from the float and double versions on the Godbolt compiler explorer

float_to_vec:   # gcc 5.3 -O3 -march=core2
    movd    eax, xmm0       # x, x
    movd    xmm0, eax       # D.26867, x
    ret

float_to_vec:   # gcc5.3 -O3 -march=nehalem
    insertps        xmm0, xmm0, 0xe # D.26867, x
    ret

double_to_vec:    # gcc5.3 -O3 -march=nehalem.  It could still have use movq or insertps, instead of this longer-latency store-forwarding round trip
    movsd   QWORD PTR [rsp-16], xmm0      # %sfp, x
    movsd   xmm0, QWORD PTR [rsp-16]      # D.26881, %sfp
    ret

float_to_vec:   # clang3.8 -O3 -march=core2 or generic (no -march)
    xorps   xmm1, xmm1
    movss   xmm1, xmm0              # xmm1 = xmm0[0],xmm1[1,2,3]
    movaps  xmm0, xmm1
    ret

double_to_vec:  # clang3.8 -O3 -march=core2, nehalem, or generic (no -march)
    movq    xmm0, xmm0              # xmm0 = xmm0[0],zero
    ret


float_to_vec:    # clang3.8 -O3 -march=nehalem
    xorps   xmm1, xmm1
    blendps xmm0, xmm1, 14          # xmm0 = xmm0[0],xmm1[1,2,3]
    ret

So both clang and gcc use different strategies for float vs. double, even when they could use the same strategy.

Using integer operations like movq between floating-point operations causes extra bypass delay latency. Using insertps to zero the upper elements of the input register should be the best strategy for float or double, so all compilers should use that when SSE4.1 is available. xorps + blend is good, too, and can run on more ports than insertps. The store/reload is probably the worst, unless we're bottlenecked on ALU throughput, and latency doesn't matter.

Superb answer, thanks! It not only answered my simple question, but quite honestly already answered a few other doubts I had in my mind. — user123443563, Oct 01 '16 at 06:03

Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

2 Answers2

re: your sequence

Linked

Related