-1

I have trouble understanding the usage of SSE intrinsics to store results of some SIMD calculation back into "normal variables". For example the _mm_store_ps intrinsic is described in the "Intel Intrinsics Guide" as follows:

void _mm_store_ps (float* mem_addr, __m128 a)

Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

The first argument is a pointer to a float which has a size of 32bits. But the description states, that the intrinsic will copy 128 bits from a into the target mem_addr.

  • Does mem_addr need to be an array of 4 floats?
  • How can I access only a specific 32bit element in a and store it in a single float?
  • What am I missing conceptually?
  • Are there better options then the _mm_store_ps intrinsic?

Here is a simple struct where doSomething() adds 1 to x/y of the struct. Whats missing is the part on how to store the result back into x/y while only the higher 32bit wide elements 2 & 3 are used, while 1 & 0 are unused.

struct vec2 {
   union {
         struct {
            float data[2];
         };
         struct {
            float x, y;
         };
      };

   void doSomething() {
      __m128 v1 = _mm_setr_ps(x, y, 0, 0);
      __m128 v2 = _mm_setr_ps(1, 1, 0, 0);
      __m128 result = _mm_add_ps(v1, v2);
      // ?? How to store results in x,y ??
   }
}
n1198943
  • 13
  • 2
  • Use `_mm_store_sd` to do a 64-bit store of the low half of a vector. Or [`_mm_storel_pi`](http://felixcloutier.com/x86/MOVLPS.html) (`movlps`). Instead of `_mm_setr`, you could use `_mm_load_sd((float*)&vec.x)` to do a 64-bit load that zero-extends to a 128-bit vector. – Peter Cordes Oct 13 '18 at 13:56
  • 1
    `mem_addr` doesn't need to be declared as a `float[]` but it needs to be properly aligned, which can be done with aligned allocation like `_mm_malloc` or `aligned_malloc`, although `malloc` should already allocate to `alignof(std::max_align_t)` bytes. If data is not dynamically allocated then `alignas` keyword should be used (on your `data` field of `vec2` type for example). – Jack Oct 13 '18 at 13:59

1 Answers1

0

It's a 128-bit load or store, so yes the arg is like float mem[4]. Remember that in C, passing an array to a function / intrinsic is the same as passing a pointer.

Intel's intrinsics are somewhat special because they don't follow the normal strict-aliasing rules, at least for integer. (e.g. _mm_loadu_si128((const __m128i*)some_pointer) doesn't violate strict-aliasing even if it's a pointer to long. I think the same applies to the float/double load/store intrinsics, so you can safely use them to load/store from/to whatever you want. Usually you'd use _mm_load_ps to load single-precision FP bit patterns, and usually you'd be keeping those in C objects of type float, though.

How can I access only a specific 32bit element in a and store it in a single float?

Use a vector shuffle and then _mm_cvtss_f32 to cast the vector to scalar.


loading / storing 64 bits

Ideally you could operate on 2 vectors at once packed together, or an array of X values and an array of Y values, so with a pair of vectors you'd have the X and Y values for 4 pairs of XY coordinates. (struct-of-arrays instead of array-of-structs).

But you can express what you're trying to do efficiently like this:

struct vec2 {
    float x,y;
};

void foo(const struct vec2 *in, struct vec2 *out) {
    __m128d tmp = _mm_load_sd( (const double*)in );  //64-bit zero-extending load with MOVSD
    __m128  inv = _mm_castpd_ps(tmp);             // keep the compiler happy
    __m128  result = _mm_add_ps(inv,  _mm_setr_ps(1, 1, 0, 0) );

    _mm_storel_pi( out, result );
}

GCC 8.2 compiles it like this (on Godbolt), for x86-64 System V, strangely using movq instead of movsd for the load. gcc 6.3 uses movsd.

foo(vec2 const*, vec2*):
        movq    xmm0, QWORD PTR [rdi]           # 64-bit integer load
        addps   xmm0, XMMWORD PTR .LC0[rip]     # packed 128-bit float add
        movlps  QWORD PTR [rsi], xmm0           # 64-bit store
        ret

For a 64-bit store of the low half of a vector (2 floats or 1 double), you can use _mm_store_sd. Or better _mm_storel_pi (movlps). Unfortunately the intrinsic for it wants a __m64* arg instead of float*, but that's just a design quirk of Intel's intrinsics. They often require type casting.

Notice that instead of _mm_setr, I used _mm_load_sd((const double*)&(in->x)) to do a 64-bit load that zero-extends to a 128-bit vector. You don't want a movlps load because that merges into an existing vector. That would create a false dependency on whatever value was there before, and costs an extra ALU uop.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you for the detailed explanation and taking the time to do so. Did also not think of doing a shuffle before _mm_cvtss_f32. – n1198943 Oct 13 '18 at 20:26
  • Also note that a scalar float in asm is just a value in the bottom of an XMM register, unlike a scalar int which is normally in a GP-integer register. So `_mm_cvtss_f32` is free, while `_mm_cvtsi128_si32` is normally a `movd eax, xmm0` (or directly to memory depending on what you do with the int.) – Peter Cordes May 24 '21 at 20:16
  • Also related: [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) - even intrinsics like `_mm_load_sd(const double*)` are supposed to be strict-aliasing safe, i.e. you can use it to load two floats. Last time I tested, it actually was safe in most compilers. – Peter Cordes May 24 '21 at 20:18