19

Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that.

The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that.

void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
  for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
  {
    __m128i _s = _mm_load_si128( (__m128i*) src );
    __m128i _d = _mm_load_si128( (__m128i*) dst );

    _d = _mm_add_epi16( _d, _s );

    _mm_store_si128( (__m128i*) dst, _d );
  }
}

The second example just did the add operations directly on the memory addresses without load/store operation. Both seam to work fine.

void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
  for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
  {
    *(__m128i*) dst = _mm_add_epi16( *(__m128i*) dst, *(__m128i*) src );
  }
}

So the question is if the 2nd example is correct or may have any side effects and when to use load/store is mandatory.

Thanks.

Peter
  • 785
  • 2
  • 7
  • 18
  • 1
    Does anyone know of any "official" document explaining this in deep? I used the "Intel® C++ Intrinsics Reference", but found it to not clearly answer my question. – Peter Jun 14 '12 at 15:52
  • 3
    The main purpose of the `load`/`loadu` intrinsics are to communicate alignment information to the compiler. And (for float/double), to type-cast from `float*` to `__m128` or `double*` to `__m128d`. For integer, you have to cast yourself. (But fixed with AVX512, where the integer load/store intrinsics take `void*` args) – Peter Cordes Aug 06 '17 at 07:10

3 Answers3

13

Both versions are fine - if you look at the generated code you will see that the second version still generates at least one load to a vector register, since PADDW (aka _mm_add_epi16) can only get its second argument directly from memory.

In practice most non-trivial SIMD code will do a lot more operations between loading and storing data than just a single add, so in general you probably want to load data initially to vector variables (registers) using _mm_load_XXX, perform all your SIMD operations on registers, then store the results back to memory via _mm_store_XXX.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • So, what you say is basically if I would have more operations that would reuse the _d/_s variables I could save in the first example and otherwise there is no difference? – Peter Jun 14 '12 at 15:50
  • Yes - that's pretty much it - loads and stores should ideally be a relatively small part of your SIMD loop (otherwise you will most likely be memory bandwidth bound rather than compute bound) so it doesn't matter *too* much exactly how data gets from memory to SIMD registers and back again. – Paul R Jun 14 '12 at 16:24
  • @PaulR Is it correct that if you use load and then change the created variable source will not change, and if you use pointer and make a change source will change? – Martinsos Nov 18 '13 at 19:18
  • @Martinsos: sorry - I don't fully understand what you're asking - maybe you could post a new question with a code example to illustrate what you're asking about ? – Paul R Nov 18 '13 at 21:40
7

The main difference is that in the second version the compiler will generate unaligned loads ( movdqu etc. ) if it can not prove the pointers to be 16 byte aligned. Depending on the surrounding code, it may not even be possible to write code where this property can be proven by the compiler.

Otherwise there is no difference, the compiler is smart enough to mangle two load and the add into one load and an add-from-memory if it deems useful or to split up an load-and-add instructions into two.

If you are using c++, you can also write

void _add( __v8hi* dst, __v8hi const * src, size_t n )
{
    n /= 8;
    for( int i=0; i<n; ++i )
        d[i| += s[i];
}

__v8hi is an abbreviation for vector of 8 half integers or typedef short __v8hi __attribute__ ((__vector_size__ (16)));, there are similar predefined types for each vector type, supported by both gcc and icc.

This will result in almost the same code, which may or may not be even faster. But one could argue that it is more readable and it can easily be extended to AVX, possibly even by the compiler.

Gunther Piez
  • 29,760
  • 6
  • 71
  • 103
  • 1
    I've never actually seen the compiler generate misaligned loads for that type of casting. Even when the data-type is (intentionally) misaligned. And of course it crashes when I run it. – Mysticial Jun 15 '12 at 07:40
  • I have had this happen to me more than once. AFAIR some unions and casting were involved. – Gunther Piez Jun 15 '12 at 07:42
  • I looked into the assembly of my code and found no MOVDQU instructions. Everything is compiled to MOVDQA so it seams to be fine. – Peter Jun 18 '12 at 09:30
  • If you want to do unaligned loads/stores using GNU C native vectors, you need to use `__attribute__ ((__vector_size__ (16), aligned(1)))`. See https://stackoverflow.com/questions/18199605/better-way-to-load-vectors-from-memory-clang. gcc's emmintrin.h definition of `__m128i` doesn't use `aligned(1)`, so dereferencing a pointer to it is assumed to be an aligned access. (It does use `__may_alias__`, though, so it's assumed to alias anything, not just `long long`.) – Peter Cordes Aug 02 '17 at 23:16
2

With gcc/clang at least, foo = *dst; is exactly the same as foo = _mm_load_si128(dst);. The _mm_load_si128 way is usually preferred by convention, but plain C/C++ dereferencing of an aligned __m128i* is also safe.


The main purpose of the load/loadu intrinsics are to communicate alignment information to the compiler.

For float/double, they also type-cast between (const) float* and __m128 or (const) double* <-> __m128d. For integer, you still have to cast yourself :(. But that's fixed with AVX512 intrinsics, where the integer load/store intrinsics take void* args.

Compilers can still optimize away dead stores or reloads, and fold loads into memory operands for ALU instructions. But when they do actually emit stores or loads in their assembly output, they do it in a way that is won't fault given the alignment guarantees (or lack thereof) in your source.

Using aligned intrinsics lets compilers fold loads into memory operands for ALU instructions with SSE or AVX. But unaligned load intrinsics can only fold with AVX, because SSE memory operands are like movdqa loads. e.g. _mm_add_epi16(xmm0, _mm_loadu_si128(rax)) could compile to vpaddw xmm0, xmm0, [rax] with AVX, but with SSE would have to compile to movdqu xmm1, [rax] / paddw xmm0, xmm1. A load instead of loadu could let it avoid a separate load instruction with SSE, too.


As is normal for C, dereferencing a __m128i* is assumed to be an aligned access, like load_si128 or store_si128.

In gcc's emmintrin.h, the __m128i type is defined with __attribute__ ((__vector_size__ (16), __may_alias__ )).

If it had used __attribute__ ((__vector_size__ (16), __may_alias__, aligned(1) )), gcc would treat a dereference as an unaligned access.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for your answer, but it's so detailed I'm not shure if I got the point. Do you say both versions will compile ok, but if I don't use load the compiler can't decide about alignment and will always assume unaligned memory? – Peter Aug 07 '17 at 09:46
  • 1
    @Peter: `foo = *dst` is exactly the same as `foo = _mm_load_si128(dst)`. The "default" when you dereference a `__m128i` is an access that may fault on unaligned. – Peter Cordes Aug 07 '17 at 10:02
  • one thing I just noticed is that ICC 18.0 is using `movdqu` even when I explicit use `_mm_load_si128` and `_mm_store_si128`. MSVC, GCC and CLANG still generate the expected aligned load/store instructions. Is this a bug or is it Intel way of saying "nowadays unaligned load/stores has little to no impact so we will simply use unaligned instructions all the time"? – user1593842 Jul 29 '18 at 18:32
  • 1
    @user1593842: `movdqu` on aligned addresses is exactly as fast as `movdqa`, on Nehalem and later. (And on AMD Bulldozer and later). http://agner.org/optimize/. IDK why they do that; code can still fault on something like `paddd xmm0, [mem]` if memory isn't aligned, and ICC does still do that. MSVC does the same thing, though. Maybe they just simplified their asm output function to not care about alignment and always use the unaligned version. Maybe they want to be more forgiving of unaligned? Or maybe it's a "cripple AMD" feature; `movdqu` stores (not loads) are slower on K10. – Peter Cordes Jul 29 '18 at 18:58
  • @Peter Cordes: ohhhh, I liked the cripple AMD theory :) "last time we explicitly created different code paths we got a lot a bad press. Well, no different code paths anymore" – user1593842 Aug 13 '18 at 17:59