4

I've noticed that accessing __m128 fields by index is possible in gcc, without using the union trick.

__m128 t;

float r(t[0] + t[1] + t[2] + t[3]);

I can also load a __m128 just like an array:

__m128 t{1.f, 2.f, 3.f, 4.f};

This is all in line with gcc's vector extensions. These, however, may not be available elsewhere. Are the loading and accessing features supported by the intel compiler and msvc?

user1095108
  • 14,119
  • 9
  • 58
  • 116

3 Answers3

3

If you want you code to work on other compilers then don't use those GCC extensions. Use the set/load/store intrinsics. _mm_setr_ps is fine for setting constant values but should not be used in a loop. To access elements I normally store the values to an array first then read the array.

If you have an array a you should read/store it in with

__m128 t = _mm_loadu_ps(a);
_mm_storeu_ps(a, t);

If the array is 16-byte aligned you can use an aligned load/store which is slightly faster on newer systems but much faster on older systems.

__m128 t = _mm_load_ps(a);
_mm_store_ps(a, t);

To get 16-byte aligned memory on the stack use

__declspec(align(16)) const float a[] = ...//MSVC
__attribute__((aligned(16))) const float a[] ...//GCC, ICC

For 16-byte aligned dynamic arrays use:

float *a = (float*)_mm_malloc(sizeof(float)*n, 16); //MSVC, GCC, ICC, MinGW 
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I like addressing by index, as this allows me not to do those pesky shifts. What about the shifts? – user1095108 Oct 25 '13 at 12:19
  • Shifts? You store to an array and access the array by index like any array. – Z boson Oct 25 '13 at 12:25
  • Ahh, now I see what you mean... From disassembly I've seen that the compiler does some shifting of the xmm registers to implement the index addressing. – user1095108 Oct 25 '13 at 12:29
  • Well you can't have it both ways in this case. Don't expect those GCC extensions to go into MSVC. Those extensions can also lead to non-optimal code if you're not careful. Particularly if you treat the SIMD registers like an array and insert/read one or two elements of the register in a loop. The extensions make this so easy, too easy... – Z boson Oct 25 '13 at 13:40
  • Well, one doesn't need to think of extensions as being particular to `gcc`, these extensions have made it (perhaps partially, I don't have `icc`) into `icc`. They also wouldn't need to get into MSVC in their entirety. – user1095108 Oct 25 '13 at 14:25
2

You can also use macros for this:

//Test to see if we are using MSVC as MSVC AVX types are slightly different to the GCC ones
#ifdef _MSC_VER 
#define GET_F32_AVX_MULTIPLATTFORM(vector,index) (vector).m256_f32[index]
#define GET_F64_AVX_MULTIPLATTFORM(vector,index) (vector).m256d_f64[index]
#else 
#define GET_F32_AVX_MULTIPLATTFORM(vector,index) (vector)[index]
#define GET_F64_AVX_MULTIPLATTFORM(vector,index) (vector)[index]
#endif
1

To load a __m128, you can write _mm_setr_ps(1.f, 2.f, 3.f, 4.f), which is supported by GCC, ICC, MSVC and clang.

So far as I know, clang and recent versions of GCC support accessing __m128 fields by index. I don't know how to do this in ICC or MSVC. I guess _mm_extract_ps works for all 4 compilers but its return type is insane making it painful to use.

chys
  • 1,546
  • 13
  • 17
  • You don't want `_mm_extract_ps` for this because a scalar float is just the low element of an XMM register. You want to shuffle the float you want to the bottom of a vector (e.g. with `_mm_shuffle_ps` or `_mm_insert_ps`), for `_mm_cvtps_f32` (which is just a cast, no asm instructions). You don't want the compiler to emit an `extractps` instruction unless the destination is memory; it's basically just `pextrd`. ( [Intel SSE: Why does \`\_mm\_extract\_ps\` return \`int\` instead of \`float\`?](https://stackoverflow.com/a/41191977)) – Peter Cordes Sep 16 '21 at 00:18
  • @PeterCordes Contemporary compilers don't interpret `_mm_extract_ps` as a command to generate *exactly* the extractps instruction, but instead "extract an element using whatever instruction the compiler thinks is best". GCC may generate nothing, movshdup, movhlps, shufps, extractps, or pextrd for `_mm_extract_ps`, depending on the index and context. – chys Sep 29 '21 at 11:16
  • MSVC doesn't optimize intrinsics, and *does* use what you asked for. I think ICC is the same. Clang's shuffle optimizer is good, and GCC may or may not figure this out if you type-pun it back to float with `memcpy` or a union or `std::bit_cast`, but it's super inconvenient and leading the compiler in the direction of bad asm. – Peter Cordes Sep 29 '21 at 12:32