8

I have the following piece of C code:

__m128 pSrc1 = _mm_set1_ps(4.0f);
__m128 pDest;
int i;
for (i=0;i<100;i++) {
       m1 = _mm_mul_ps(pSrc1, pSrc1);      
       m2 = _mm_mul_ps(pSrc1, pSrc1);        
       m3 = _mm_add_ps(m1, m2);             
       pDest = _mm_add_ps(m3, m3); 
}

float *arrq = (float*) pDest;

Everything until the end of the for loop works. What I am trying to do now is to cast the __m128 type back to float. Since it stores 4 floats I thought I easily can cast it back to float*. What am I doing wrong? (This is a test code, so don't wonder). I basically tried all possible conversions I could think of. Thx for your help.

Anteru
  • 19,042
  • 12
  • 77
  • 121

3 Answers3

11

You can to use _mm_store_ps to store a __m128 vector into a float array.

alignas(16) float result [4];
_mm_store_ps (result, pDest);

// If result is not 16-byte aligned, use _mm_storeu_ps
// On modern CPUs this is just as fast as _mm_store_ps if
// result is 16-byte aligned, but works in all other cases as well
_mm_storeu_ps (result, pDest);

You can then access any / all elements from that temporary array, and if you're lucky the compiler will turn this into a shuffle instead of store/reload if that's more efficient. (If the destination isn't just a temporary and you actually want all 4 elements stored somewhere, then _mm_storeu_ps or store is exactly what you want.)

If you want just the low element, float _mm_cvtss_f32(__m128) is good.

If you want to combine the vector elements down to a single float after a loop that sums an array or does a dot-product, see Fastest way to do horizontal SSE vector sum (or other reduction)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Anteru
  • 19,042
  • 12
  • 77
  • 121
  • Thanks alot. That was quite easy. I am now to the field, so sorry for the stupid question –  Jan 16 '13 at 20:53
  • 5
    [Watch out with stack variables though, `result` should be 16-byte aligned.](http://stackoverflow.com/questions/841433/gcc-attribute-alignedx-explanation) – user7116 Jan 16 '13 at 20:57
2

I believe casting works if you cast properly. I don't have the code in front of me, but I'm pretty sure this worked for me:

float *arrq = reinterpret_cast<float*>(&pDest);

Note that it uses a C++ cast describing what you are doing, and it is converting the address of it into a pointer.

Aaron D. Marasco
  • 6,506
  • 3
  • 26
  • 39
  • This is indeed the way to go if you want to avoid needless copying. Also many C++ coders should learn to use C++ casting. Though it's cumbersome to write (well, not really with a good editor and completion), it improves readability. – St0fF Aug 25 '16 at 10:17
  • This is strict aliasing undefined behaviour and may break in practice. At least pointing an `int*` onto a `__m256i` can break in practice: [GCC AVX \_\_m256i cast to int array leads to wrong values](https://stackoverflow.com/q/71364764) – Peter Cordes Oct 18 '22 at 17:25
1

You can also use _mm_cvtss_f32 to convert directly without touching memory which is convenient if you are only dealing with a few values. The _mm_storeu_ps answer is better if you are processing a whole array.

__m128 reg;
float val = _mm_cvtss_f32(reg);
manylegged
  • 794
  • 7
  • 14