13

I'm poking around in somebody else's code and currently trying to figure out why _mm_load_si128 exists.

Essentially, I tried replacing

_ra = _mm_load_si128(reinterpret_cast<__m128i*>(&cd->data[idx]));

with

_ra = *reinterpret_cast<__m128i*>(&cd->data[idx]);

and it works and performs exactly the same.

I figured that the load functions exist for smaller types just for the sake of convenience so people wouldn't have to pack them into continuous memory manually but for data that is already in the correct order, why bother?

Is there something else that _mm_load_si128 does? Or is it essentially just a roundabout way of assigning a value?

plasmacel
  • 8,183
  • 7
  • 53
  • 101
user81993
  • 6,167
  • 6
  • 32
  • 64

1 Answers1

15

There are explicit and implicit loads in SSE.

  • _mm_load_si128(reinterpret_cast<__m128i*>(&cd->data[idx])); is an explicit load
  • *reinterpret_cast<__m128i*>(&cd->data[idx]); is an implicit load

With an explicit load you explicitly instruct the compiler to load the data into an XMM register - this is the "official" Intel way to do it. You can also control whether the load is an aligned or unaligned load by using _mm_load_si128 or _mm_loadu_si128.

Although as an extension, most compilers are also able to automatically generate XMM loads when you do type-punning, but this way you cannot control whether the load is aligned or unaligned. In this case, since on modern CPUs there is no performance penalty of using unaligned loads when the data is aligned, compilers tend to use unaligned loads universally.

An another, more important aspect is that with implicit loads you violate strict aliasing rules, which can result in undefined behavior. Although it's worth to mention that - as part of the extension - compilers which support Intel intrinsics don't tend to enforce strict aliasing rules on XMM placeholder types like __m128, __m128d, __m128i.

Nevertheless I think explicit loads are cleaner and more bulletproof.


Why do compilers don't tend to enforce strict aliasing rules on SSE placeholder types?

The 1st reason lies in the design of the SSE intrinsics: there are obvious cases when you have to use type-punning, since there is no other way to use some of the intrinsics. Mysticial's answer summarizes it perfectly.

As Cody Gray pointed out in the comments, it's worth to mention that historically MMX instrinsics (which are now mostly superseded by SSE2) didn't even provide explicit loads or stores - you had to use type-punning.

The 2nd reason (somewhat related to the 1st) lies in the type definitions of these types.

GCC's typedefs for the SSE/SSE2 placeholder types in <xmmintrin.h > and <emmintrin.h>:

/* The Intel API is flexible enough that we must allow aliasing with other
   vector types, and their scalar components.  */

typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));    
typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
typedef double __m128d __attribute__ ((__vector_size__ (16), __may_alias__));

The key here is the __may_alias__ attribute, which makes type-punning work on these types even when strict aliasing is enabled with the -fstrict-aliasing flag.

Now, since clang and ICC are compatible with GCC, they should follow the same convention. So currently, in these 3 compilers implicit loads/stores are somewhat guaranteed to work even with -fstrict-aliasing flag. Finally, MSVC doesn't support strict aliasing at all, so it cannot even be an issue there.

Still, this doesn't mean that you should prefer implicit loads/stores over explicit ones.

plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • The key component to this answer is the strict aliasing—explicit loads avoid undefined behavior. Do you have some type of reference for the fact that compilers supporting Intel intrinsics don't enforce strict aliasing rules on XMM types, or is that just based on your own experience? I ask because it fits my experience, too, but just because something works doesn't mean it doesn't risk UB! – Cody Gray - on strike May 27 '17 at 23:48
  • It might also be added that these explicit loads are new for SSE. They weren't provided by the MMX intrinsics, which basically made implicit loads and ugly casts essential for *all* load operations. – Cody Gray - on strike May 27 '17 at 23:50
  • @CodyGray This behavior is not officially associated with Intel intrinsics, however there are obvious cases when the design of the intrinsics forces you to use aliasing - there is no other way. I recommend this answer: https://stackoverflow.com/a/24788226/2430597. – plasmacel May 28 '17 at 00:07
  • @CodyGray In the comments of the linked answer you can find GCC's `typedef` for the `__m128i` type, which is declared with the `__may_alias` specifier, which makes type-punning work even with `-fstrict-aliasing` flag. Since clang is compatible to GCC, it should be identical there. Finally MSVC doesn't support strict aliasing at all. So currently in these 3 compilers it is guaranteed to work. Regarding MMX, the question targeted SSE, so I think MMX is not the subject of it. – plasmacel May 28 '17 at 00:08
  • @CodyGray I assume the same is also true for Intel's compiler, especially since it is also compatible with GCC. – plasmacel May 28 '17 at 00:19
  • 2
    You can control if the load is aligned or unaligned, see for instance the unofficial `__m128i_u` typedef. – Marc Glisse Dec 30 '17 at 22:20
  • @MarcGlisse `__m128i_u` is not a "standard" Intel placeholder type, it is compiler specific, which is not available in all compilers. – plasmacel Dec 30 '17 at 22:23
  • It seemed that you were already talking about compiler-specific behavior quite a bit, but ok. By the way, I didn't mean for people to use `__m128d_u`, I meant for them to use the aligned attribute (which is used to define `__m128d_u`). – Marc Glisse Dec 30 '17 at 22:28
  • @MarcGlisse Yeah, maybe it worth to mention and I will update my answer. – plasmacel Dec 30 '17 at 22:39
  • 1
    gcc/clang use aligned loads when dereferencing a `__m128i*`. You only get unaligned if you ask for it. (I'd assume that even ICC and MSVC would still fold such loads into memory operands for SSE instructions, e.g. `paddd xmm0, [rsi]`, which would give them an alignment requirement too. Without AVX, you probably only get unaligned loads if ICC/MSVC decide to use a separate load instruction (like `movdqu`)). – Peter Cordes Dec 31 '17 at 01:48