How to implement "_mm_storeu_epi64" without aliasing problems?

Question

(Note: Although this question is about "store", the "load" case has the same issues and is perfectly symmetric.)

The SSE intrinsics provide an _mm_storeu_pd function with the following signature:

void _mm_storeu_pd (double *p, __m128d a);

So if I have vector of two doubles, and I want to store it to an array of two doubles, I can just use this intrinsic.

However, my vector is not two doubles; it is two 64-bit integers, and I want to store it to an array of two 64-bit integers. That is, I want a function with the following signature:

void _mm_storeu_epi64 (int64_t *p, __m128i a);

But the intrinsics provide no such function. The closest they have is _mm_storeu_si128:

void _mm_storeu_si128 (__m128i *p, __m128i a);

The problem is that this function takes a pointer to __m128i, while my array is an array of int64_t. Writing to an object via the wrong type of pointer is a violation of strict aliasing and is definitely undefined behavior. I am concerned that my compiler, now or in the future, will reorder or otherwise optimize away the store thus breaking my program in strange ways.

To be clear, what I want is a function I can invoke like this:

__m128i v = _mm_set_epi64x(2,1);
int64_t ra[2];
_mm_storeu_epi64(&ra[0], v); // does not exist, so I want to implement it

Here are six attempts to create such a function.

Attempt #1

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    _mm_storeu_si128(reinterpret_cast<__m128i *>(p), a);
}

This appears to have the strict aliasing problem I am worried about.

Attempt #2

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    _mm_storeu_si128(static_cast<__m128i *>(static_cast<void *>(p)), a);
}

Possibly better in general, but I do not think it makes any difference in this case.

Attempt #3

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    union TypePun {
        int64_t a[2];
        __m128i v;
     };
    TypePun *p_u = reinterpret_cast<TypePun *>(p);
    p_u->v = a;
}

This generates incorrect code on my compiler (GCC 4.9.0), which emits an aligned movaps instruction instead of an unaligned movups. (The union is aligned, so the reinterpret_cast tricks GCC into assuming p_u is aligned, too.)

Attempt #4

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    union TypePun {
        int64_t a[2];
        __m128i v;
     };
    TypePun *p_u = reinterpret_cast<TypePun *>(p);
    _mm_storeu_si128(&p_u->v, a);
}

This appears to emit the code I want. The "type-punning via union" trick, although technically undefined in C++, is widely-supported. But is this example -- where I pass a pointer to an element of a union rather than access via the union itself -- really a valid way to use the union for type-punning?

Attempt #5

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    p[0] = _mm_extract_epi64(a, 0);
    p[1] = _mm_extract_epi64(a, 1);
}

This works and is perfectly valid, but it emits two instructions instead of one.

Attempt #6

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    std::memcpy(p, &a, sizeof(a));
}

This works and is perfectly valid... I think. But it emits frankly terrible code on my system. GCC spills a to an aligned stack slot via an aligned store, then manually moves the component words to the destination. (Actually it spills it twice, once for each component. Very strange.)

...

Is there any way to write this function that will (a) generate optimal code on a typical modern compiler and (b) have minimal risk of running afoul of strict aliasing?

It's actually very difficult to avoid violating strict-aliasing when writing SSE intrinsic code. I believe all the compilers treat the vector types as aggregates of the base types which means you can freely cast back back and forth. Personally, I use #1 for function parameters and #4 for stack variables. — Mysticial, Jul 16 '14 at 17:56
@Mysticial: You might be the most-qualified person on SO to answer this question, so thank you. But what is the "base type" of `__m128i`? I thought it was any of 8-, 16-, 32-, or 64-bit integer, depending on which intrinsic you call... Or are you saying it is effectively a union of all of these options, so #1 is actually a safe way to write to an array of char, short, etc. ? — Nemo, Jul 16 '14 at 18:01
Visual Studio treats `__m128i` as a 4-way union for all those types, but VS doesn't do strict-aliasing anyway. I'm not sure how GCC handles it, but I bet it's the same thing. Turn on `-Wall` on GCC and see if it complains about #1. Interestingly, they've fixed this problem in AVX512 intrinsics by changing all the pointer types to `void*`. — Mysticial, Jul 16 '14 at 18:06
I agree with Mysticial; this is one of those places where, in my opinion, writing code for best code generation and writing code to fully comply with C/C++ standards are competing goals. While it is technically undefined behavior, I can't see any practical case where the compiler would think to do anything that would foul up your program. — Jason R, Jul 16 '14 at 18:07
@Mysticial: If you feel motivated enough to turn these comments into some kind of answer, I will accept it. — Nemo, Jul 16 '14 at 18:11
@JasonR: This is not just undefined behavior; it is a violation of "strict aliasing" rules, which compilers absolutely do rely upon to optimize. Follow the links in the question if you have never encountered this problem. The compiler has to treat `__m128i` very specially for this not to be a serious issue. (If you define a similar struct by hand with similar supporting functions, GCC will definitely screw you up.) — Nemo, Jul 16 '14 at 18:16
@Nemo: That makes sense; I hadn't read the details of your application closely enough. I've never run into any problems with aliasing, but then again I can't think of a time where I have tried to do what you're doing (using SIMD intrinsics to populate an array of a scalar type in the same function). — Jason R, Jul 17 '14 at 11:04

Mysticial · Accepted Answer · 2014-08-03T05:30:19.110

20

SSE intrinsics is one of those niche corner cases where you have to push the rules a bit.

Since these intrinsics are compiler extensions (somewhat standardized by Intel), they are already outside the specification of the C and C++ language standards. So it's somewhat self-defeating to try to be "standard compliant" while using a feature that clearly is not.

Despite the fact that the SSE intrinsic libraries try to act like normal 3rd party libraries, underneath, they are all specially handled by the compiler.

The Intent:

The SSE intrinsics were likely designed from the beginning to allow aliasing between the vector and scalar types - since a vector really is just an aggregate of the scalar type.

But whoever designed the SSE intrinsics probably wasn't a language pedant.
^{(That's not too surprising. Hard-core low-level performance programmers and language lawyering enthusiasts tend to be very different groups of people who don't always get along.)}

We can see evidence of this in the load/store intrinsics:

__m128i _mm_stream_load_si128(__m128i* mem_addr) - A load intrinsic that takes a non-const pointer?
void _mm_storeu_pd(double* mem_addr, __m128d a) - What if I want to store to __m128i*?

The strict aliasing problems are a direct result of these poor prototypes.

Starting from AVX512, the intrinsics have all been converted to void* to address this problem:

__m512d _mm512_load_pd(void const* mem_addr)
void _mm512_store_epi64 (void* mem_addr, __m512i a)

Compiler Specifics:

Visual Studio defines each of the SSE/AVX types as a union of the scalar types. This by itself allows strict-aliasing. Furthermore, Visual Studio doesn't do strict-aliasing so the point is moot:
The Intel Compiler has never failed me with all sorts of aliasing. It probably doesn't do strict-aliasing either - though I've never found any reliable source for this.
GCC does do strict-aliasing, but from my experience, not across function boundaries. It has never failed me to cast pointers which are passed in (on any type). GCC also declares SSE types as __may_alias__ thereby explicitly allowing it to alias other types.

My Recommendation:

For function parameters that are of the wrong pointer type, just cast it.
For variables declared and aliased on the stack, use a union. That union will already be aligned so you can read/write to them directly without intrinsics. (But be aware of store-forwarding issues that come with interleaving vector/scalar accesses.)
If you need to access a vector both as a whole and by its scalar components, consider using insert/extract intrinsics instead of aliasing.
When using GCC, turn on -Wall or -Wstrict-aliasing. It will tell you about strict-aliasing violations.

edited Aug 03 '14 at 05:30

answered Jul 16 '14 at 18:37

Mysticial

464,885
45
335
332

1

"GCC does do strict-aliasing, but from my experience, not across function boundaries." Even for inlined functions? – Nemo Jul 16 '14 at 18:38
That's worth investigating. I'm not sure of the answer myself. – Mysticial Jul 16 '14 at 18:40
Note that SSE types were never intended to be stored in memory as-is in the first place. I don't know why the pointer type in the signature is `__m128i *`, but generally SSE and memory-backed variables don't mix well. – user541686 Jul 16 '14 at 19:10
@Mehrdad I dunno, I regularly allocate arrays of `__m128d` and such. Typically as scratch memory. – Mysticial Jul 16 '14 at 19:12
@Mehrdad: Everything comes from or goes to memory eventually... @Mysticial: I always compile with `-Wall` and none of these examples generates a warning. I found GCC's typedef for __m128i in `emmintrin.h`: `typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));` -- so I guess that explains it. Thanks again. – Nemo Jul 16 '14 at 19:15
@Nemo: I think SSE registers were made to be tied to registers, not to be stored directly into memory. That's why they have specialized functions for dealing with memory. – user541686 Jul 16 '14 at 19:34
@Mysticial: Are you talking about statically-sized arrays or dynamic ones? If you mean static, aren't those implicitly still bound to registers (as spilling as allows anyway) rather than in memory? What I mean is that doing something that forces them to be stored in memory generally hasn't ended well for me (obviously implicit spill is fine though, and not something I can control anyway). – user541686 Jul 16 '14 at 19:35
@Mehrdad Dynamic. Think `(__m128d*)aligned_malloc(10000000 * sizeof(__m128d))`. I do that a lot in apps/algorithms that are completely *designed* around the SIMD vector. – Mysticial Jul 16 '14 at 19:37
@Mysticial: And you actually *load* and *store* them like regular variables too? I (and the compiler) don't really care about the *pointer* types per se, but storing/retrieving variables of SSE types from memory is what I'm concerned about (like `__m128d *p = ...; *p = blah;`) – user541686 Jul 16 '14 at 19:38
@Mehrdad Yeah. In some of my code, SIMD vectors are no different than any other datatype. `void func(__m128d *a,__m128d *b){a[0] = _mm_add_pd(a[0],b[0])}` – Mysticial Jul 16 '14 at 19:44
@Mysticial: Which compiler? I've felt like I've gotten different results with `_mm_loadXYZ` and `_mm_storeXYZ` than with plain load/store within the language (for some reason plain memory operations seemed to be not optimized as well, but I can't recall a specific example to demonstrate). – user541686 Jul 16 '14 at 20:08
@Mehrdad I've never noticed a difference - and definitely not in performance. The compiler needs to have that logic anyway to deal with register spills. – Mysticial Jul 16 '14 at 20:54
@Mysticial: Weird, ok. If I come across it again I'll let you know, because I'm sure I've seen weird stuff happen when I've tried that (in MSVC anyway). – user541686 Jul 16 '14 at 21:10
@Mysticial: On that note, what's the point of having all those `loadu`/`storeu` intrinsics if we can just deal with memory directly? – user541686 Jul 16 '14 at 21:12
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/57447/discussion-between-mysticial-and-mehrdad). – Mysticial Jul 16 '14 at 21:13
@Mysticial: I actually don't have time for chat right now sorry! Maybe later in the day when I'm back. – user541686 Jul 16 '14 at 21:15
4

The mention of the `may_alias` attribute (https://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html) is a bit hidden in the comments, it would be nice to add it to the answer. – Marc Glisse Aug 02 '14 at 12:28
@MarcGlisse Done. Thanks for suggesting that. :) – Mysticial Aug 03 '14 at 05:30
Do you know why AVX512 has `vmovdqa32` and `vmovdqa64`? GCC seems to only generate `vmovdqa64`. I suspect there is no real difference just like `movaps` are `movapd` make no difference and probably never will. – Z boson Dec 30 '15 at 11:41
1

@Zboson The masking. If you don't use the mask, then they're the same. – Mysticial Dec 30 '15 at 19:36
1

The way I like to think about it is that the `load` vs. `loadu` intrinsics exist mostly to communicate alignment guarantees or lack-thereof to the compiler. For `ps` / `pd`, they also work as a cast, but for integer types it's just ugly. AVX512's `void*` intrinsics are a welcome improvement, esp. for C (where no cast is needed to convert to/from `void*`). – Peter Cordes May 03 '16 at 15:44
Current ICC19 does do strict-aliasing optimizations, and doesn't even respect `typedef int __attribute__((may_alias)) aliasing_int;`. e.g. see dead-store elimination in `test_movd_typedef_aliasing_int` in https://godbolt.org/z/sdKENn. Current GCC doesn't define `_mm_loadu_si32()` so I was trying to roll my own portable and *efficient* version, but seems impossible without typedefs for different GNU-dialect C compilers (ICC, clang, GCC). Scalar FP load/store intrinsics like `_mm_store_ss` are defined with `may_alias` internals for clang at least, and `_mm_storel_epi64` is also aliasing safe. – Peter Cordes Apr 07 '20 at 15:39