17

I recently noticed that

_m128 m = _mm_set_ps(0,1,2,3);

puts the 4 floats into reverse order when cast to a float array:

(float*) p = (float*)(&m);
// p[0] == 3
// p[1] == 2
// p[2] == 1
// p[3] == 0

The same happens with a union { _m128 m; float[4] a; } also.

Why do SSE operations use this ordering? It's not a big deal but slightly confusing.

And a follow-up question:

When accessing elements in the array by index, should one access in the order 0..3 or the order 3..0 ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Inverse
  • 4,408
  • 2
  • 26
  • 35
  • 1
    Related: [Convention for displaying vector registers](https://stackoverflow.com/q/41351087). The only thing that's "reversed" is the arg order of `_mm_set` intrinsics; everything else is normal little-endian; don't access your elements backwards unless that's easier for some other reason. Also, aliasing a `float*` onto a `__m128` is not well-defined behaviour across compilers (strict-aliasing violation); see [print a \_\_m128i variable](https://stackoverflow.com/a/46752535) – Peter Cordes Mar 29 '22 at 07:47

3 Answers3

8

It's just a convention; they had to pick some order, and it really doesn't matter what the order is as long as everyone follows it. Intel happens to like little-endianness.

As far as accessing by index goes... the best thing is to try to avoid doing it. Nothing kills vector performance like element-wise accesses. If you must, try set things up so that the indexing matches the hardware vector lanes; that's what most vector programmers (in my experience) will expect.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • Intel doesn’t follow their own order. You can test with e.g. _mm_extract_ps() intrinsic that accepts integer index. – Soonts May 21 '18 at 23:22
8

Depend on what you would like to do, you can use either _mm_set_ps or _mm_setr_ps.

__m128 _mm_setr_ps (float z, float y, float x, float w )

Sets the four SP FP values to the four inputs in reverse order.

phuclv
  • 37,963
  • 15
  • 156
  • 475
echo
  • 789
  • 2
  • 12
  • 21
7

Isn't that consistent with the little-endian nature of x86 hardware? The way it stores the bytes of a long long.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • Yes, it's just normal little endian ordering. For SSE (and SIMD programming in general) the actual order of the elements doesn't matter too much in general, *except* for when you start changing the width of elements (packing/unpacking, etc) or doing anything which accesses specific elements (permutations, insert/extract, etc). – Paul R Mar 08 '11 at 21:18