I have trouble understanding the usage of SSE intrinsics to store results of some SIMD calculation back into "normal variables". For example the _mm_store_ps intrinsic is described in the "Intel Intrinsics Guide" as follows:
void _mm_store_ps (float* mem_addr, __m128 a)
Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
The first argument is a pointer to a float which has a size of 32bits. But the description states, that the intrinsic will copy 128 bits from a into the target mem_addr.
- Does mem_addr need to be an array of 4 floats?
- How can I access only a specific 32bit element in a and store it in a single float?
- What am I missing conceptually?
- Are there better options then the _mm_store_ps intrinsic?
Here is a simple struct where doSomething() adds 1 to x/y of the struct. Whats missing is the part on how to store the result back into x/y while only the higher 32bit wide elements 2 & 3 are used, while 1 & 0 are unused.
struct vec2 {
union {
struct {
float data[2];
};
struct {
float x, y;
};
};
void doSomething() {
__m128 v1 = _mm_setr_ps(x, y, 0, 0);
__m128 v2 = _mm_setr_ps(1, 1, 0, 0);
__m128 result = _mm_add_ps(v1, v2);
// ?? How to store results in x,y ??
}
}