C++ SSE Intrinsics: Storing results in variables

Question

I have trouble understanding the usage of SSE intrinsics to store results of some SIMD calculation back into "normal variables". For example the _mm_store_ps intrinsic is described in the "Intel Intrinsics Guide" as follows:

void _mm_store_ps (float* mem_addr, __m128 a)

Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

The first argument is a pointer to a float which has a size of 32bits. But the description states, that the intrinsic will copy 128 bits from a into the target mem_addr.

Does mem_addr need to be an array of 4 floats?
How can I access only a specific 32bit element in a and store it in a single float?
What am I missing conceptually?
Are there better options then the _mm_store_ps intrinsic?

Here is a simple struct where doSomething() adds 1 to x/y of the struct. Whats missing is the part on how to store the result back into x/y while only the higher 32bit wide elements 2 & 3 are used, while 1 & 0 are unused.

struct vec2 {
   union {
         struct {
            float data[2];
         };
         struct {
            float x, y;
         };
      };

   void doSomething() {
      __m128 v1 = _mm_setr_ps(x, y, 0, 0);
      __m128 v2 = _mm_setr_ps(1, 1, 0, 0);
      __m128 result = _mm_add_ps(v1, v2);
      // ?? How to store results in x,y ??
   }
}

Use `_mm_store_sd` to do a 64-bit store of the low half of a vector. Or [`_mm_storel_pi`](http://felixcloutier.com/x86/MOVLPS.html) (`movlps`). Instead of `_mm_setr`, you could use `_mm_load_sd((float*)&vec.x)` to do a 64-bit load that zero-extends to a 128-bit vector. — Peter Cordes, Oct 13 '18 at 13:56
`mem_addr` doesn't need to be declared as a `float[]` but it needs to be properly aligned, which can be done with aligned allocation like `_mm_malloc` or `aligned_malloc`, although `malloc` should already allocate to `alignof(std::max_align_t)` bytes. If data is not dynamically allocated then `alignas` keyword should be used (on your `data` field of `vec2` type for example). — Jack, Oct 13 '18 at 13:59

score 0 · Accepted Answer · answered Oct 13 '18 at 14:18

It's a 128-bit load or store, so yes the arg is like float mem[4]. Remember that in C, passing an array to a function / intrinsic is the same as passing a pointer.

Intel's intrinsics are somewhat special because they don't follow the normal strict-aliasing rules, at least for integer. (e.g. _mm_loadu_si128((const __m128i*)some_pointer) doesn't violate strict-aliasing even if it's a pointer to long. I think the same applies to the float/double load/store intrinsics, so you can safely use them to load/store from/to whatever you want. Usually you'd use _mm_load_ps to load single-precision FP bit patterns, and usually you'd be keeping those in C objects of type float, though.

How can I access only a specific 32bit element in a and store it in a single float?

Use a vector shuffle and then _mm_cvtss_f32 to cast the vector to scalar.

loading / storing 64 bits

Ideally you could operate on 2 vectors at once packed together, or an array of X values and an array of Y values, so with a pair of vectors you'd have the X and Y values for 4 pairs of XY coordinates. (struct-of-arrays instead of array-of-structs).

But you can express what you're trying to do efficiently like this:

struct vec2 {
    float x,y;
};

void foo(const struct vec2 *in, struct vec2 *out) {
    __m128d tmp = _mm_load_sd( (const double*)in );  //64-bit zero-extending load with MOVSD
    __m128  inv = _mm_castpd_ps(tmp);             // keep the compiler happy
    __m128  result = _mm_add_ps(inv,  _mm_setr_ps(1, 1, 0, 0) );

    _mm_storel_pi( out, result );
}

GCC 8.2 compiles it like this (on Godbolt), for x86-64 System V, strangely using movq instead of movsd for the load. gcc 6.3 uses movsd.

foo(vec2 const*, vec2*):
        movq    xmm0, QWORD PTR [rdi]           # 64-bit integer load
        addps   xmm0, XMMWORD PTR .LC0[rip]     # packed 128-bit float add
        movlps  QWORD PTR [rsi], xmm0           # 64-bit store
        ret

For a 64-bit store of the low half of a vector (2 floats or 1 double), you can use _mm_store_sd. Or better _mm_storel_pi (movlps). Unfortunately the intrinsic for it wants a __m64* arg instead of float*, but that's just a design quirk of Intel's intrinsics. They often require type casting.

Notice that instead of _mm_setr, I used _mm_load_sd((const double*)&(in->x)) to do a 64-bit load that zero-extends to a 128-bit vector. You don't want a movlps load because that merges into an existing vector. That would create a false dependency on whatever value was there before, and costs an extra ALU uop.

Thank you for the detailed explanation and taking the time to do so. Did also not think of doing a shuffle before _mm_cvtss_f32. — n1198943, Oct 13 '18 at 20:26
Also note that a scalar float in asm is just a value in the bottom of an XMM register, unlike a scalar int which is normally in a GP-integer register. So `_mm_cvtss_f32` is free, while `_mm_cvtsi128_si32` is normally a `movd eax, xmm0` (or directly to memory depending on what you do with the int.) — Peter Cordes, May 24 '21 at 20:16
Also related: [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) - even intrinsics like `_mm_load_sd(const double*)` are supposed to be strict-aliasing safe, i.e. you can use it to load two floats. Last time I tested, it actually was safe in most compilers. — Peter Cordes, May 24 '21 at 20:18

C++ SSE Intrinsics: Storing results in variables

1 Answers1

loading / storing 64 bits

Related