21

There are generally two types of SIMD instructions:

A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary:

movaps  xmm0, xmmword ptr [rax]
vmovaps ymm0, ymmword ptr [rax]
vmovaps zmm0, zmmword ptr [rax]

B. And the ones that work with unaligned memory addresses, that will not raise such exception:

movups  xmm0, xmmword ptr [rax]
vmovups ymm0, ymmword ptr [rax]
vmovups zmm0, zmmword ptr [rax]

But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions from the first group at all?

MikeF
  • 1,021
  • 9
  • 29
  • For performance, of course. Accessing aligned memory is faster, it's done in one memory access cycle and it doesn't miss/flush the cache on every access. See https://stackoverflow.com/questions/2006216/why-is-data-structure-alignment-important-for-performance – memo Sep 03 '18 at 10:02
  • 7
    The aligned vs non-aligned loads is an historical artefact (see [this](https://software.intel.com/en-us/forums/intel-isa-extensions/topic/752392#comment-1916147)). Today unaligned load performs the same - though a naturally aligned operand has the benefit of never crossing a cache line or a page. – Margaret Bloom Sep 03 '18 at 10:08
  • 1
    @memo linked answers are full of misinformation and outdated information. Unaligned operations only have some minor penalties now. Anyway since Nehalem it's the alignment of the address that matters, not the alignment of the instruction. – harold Sep 03 '18 at 10:45
  • @harold Thanks, I guess you learn something new everyday. So then, the *movaps instructions are historical, for compatibility reasons? – memo Sep 03 '18 at 10:54
  • 2
    @memo mostly yes, there is still a use as a built-in "assert aligned", [some compilers have stopped using them](https://stackoverflow.com/q/42697118/555045) – harold Sep 03 '18 at 11:03
  • @harold - the penalties are fairly small on Intel, but not exactly close to zero: cache line crossing loads and stores have half the throughput (and increased latency I think but I forget how much). On AMD the penalties are much more significant and and include penalties for misaligned accesses that are entirely within one cache line. However, as you mention - it is only actual alignment that matters: both instructions perform equivalently for aligned values. – BeeOnRope Sep 03 '18 at 22:47
  • 3
    @harold Both Microsoft and Intel have taken this to a new level. As of VS2017 and ICC2018, both compilers will generate unaligned moves even for pre-Nehalem targets. MS has received [strong negative feedback](https://developercommunity.visualstudio.com/content/problem/19160/regression-from-vs-2015-in-ssseavx-instructions-ge.html) on this, but they don't care anymore since pre-Nehalem is too old. – Mysticial Sep 04 '18 at 17:39
  • @MargaretBloom: sorry to resurrect this discussion. Something just crossed my mind. Do you know if those aligned SSE instructions execute atomically vs. unaligned ones? Say, AVX-512 instructions on a 64-byte address boundary. – MikeF Sep 25 '18 at 07:34
  • @MikeF IIRC atomicity is only guaranteed for naturally aligned load/stores but I believe that current implementations are atomic at the cache line level. – Margaret Bloom Sep 25 '18 at 10:32
  • @MargaretBloom Thanks. [This is the only reference](https://i.imgur.com/0TpY1rw.png) I can find: "[Intel® 64 and IA-32 Architectures, Software Developer’s Manual, Volume 3 (3A, 3B, 3C & 3D): System Programming Guide](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf)", section 8.1.1. Although it's somewhat moot about SIMD instructions. Are they trying to say that even aligned SIMD instructions are not atomic? – MikeF Sep 25 '18 at 19:21
  • 1
    @MikeF Possibly. Each store to the cache is atomic but older CPUs with a narrow bus width will implement a SSE store as two/four *independent* stores. Each store is pushed and then flushed from the store buffer independently and if the third faults due to delayed TLB invalidation (see 4.10.4.4) then the first may have already been flushed to the cache. I believe that Intel is saying that they are free to implement SIMD loads/stores as sequence of repeated load/store uOPs. Will a `lock` prefix fix this? I don't see how. Why don't you ask here on SO officially? It's interesting! – Margaret Bloom Sep 27 '18 at 14:40
  • @MargaretBloom: Thanks for the explanation. I'll try to remember to ask in a separate thread. (Too busy now.) – MikeF Sep 29 '18 at 08:05
  • Just for the record re: atomicity: 16-byte aligned load/store are only guaranteed atomic on CPUs with the AVX feature flag, only documented many years after the fact. `movaps` isn't atomic on K8 or Core 1 or earlier, where it runs as 2 uops, 32-byte aligned load/store aren't *guaranteed* atomic on paper by anything, but are in practice on many CPUs. [SSE instructions: which CPUs can do atomic 16B memory operations?](https://stackoverflow.com/q/7646018) / https://rigtorp.se/isatomic/ – Peter Cordes Apr 02 '23 at 20:21

2 Answers2

21
  • Unaligned access: Only movups/vmovups can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.
  • Aligned access:
    • On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.
    • On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops. movups/vmovups consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, movups/vmovups can be up to twice as slow as movaps/vmovaps in terms of latency and/or throughput.

Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • Thanks. I'm curious though, if both tend to be roughly the same in performance on the modern CPUs why didn't they eliminate that #GP exception in (v)movaps instructions? Why not just alias them. – MikeF Sep 03 '18 at 18:11
  • @MikeF The instructions have different encodings and existing applications may require one or both instructions. So both encodings need to be supported to run such applications. Also aligned versions implement the alignment checks in hardware, which may eliminate the need to perform these checks in software for code that requires aligned data. – Hadi Brais Sep 03 '18 at 18:23
  • 1
    @MikeF - because once an instruction is defined one way in the ISA you cannot gerannly change its behavior through a simple doc update! Exceptions are part of this behavior. – BeeOnRope Sep 03 '18 at 22:49
  • Another factor is memory disambiguation on Sandybridge (and possibly some newers arches) the [Intel Arch Manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf#page=58): "The following loads are not disambiguated. The execution of these loads is stalled until addresses of all previous stores are known. • Loads that cross the 16-byte boundary • 32-byte Intel AVX loads that are not 32-byte aligned. ". Which could be a significant difference if the workload had intermixed loads / stores. – Noah Jun 21 '21 at 15:45
  • I've tested that this isn't the case on Tigerlake but it may also affect skylake / haswell. There don't seem to be any notes on when this was changed. @PeterCordes – Noah Jun 21 '21 at 15:45
  • 1
    @Noah: Not sure this is the idea place for these comments either; you could post it as an answer on [What's the actual effect of successful unaligned accesses on x86?](https://stackoverflow.com/q/12491578). (Or maybe on [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/a/45129784) to discuss how to actually benchmark the difference). This Q&A is mostly about the fact that `movups` has no penalty when the address is actually aligned at run-time on modern CPUs, but not earlier. – Peter Cordes Jun 21 '21 at 16:37
11

I think there is a subtle difference between using _mm_loadu_ps and _mm_load_ps even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance.

Operations which fold a load and another operation such as multiplication into one instruction can only be done with load, not loadu intrinsics, unless you compile with AVX enabled to allow unaligned memory operands.

Consider the following code

#include <x86intrin.h>
__m128 foo(float *x, float *y) {
    __m128 vx = _mm_loadu_ps(x);
    __m128 vy = _mm_loadu_ps(y);
    return vx*vy;
}

This gets converted to

movups  xmm0, XMMWORD PTR [rdi]
movups  xmm1, XMMWORD PTR [rsi]
mulps   xmm0, xmm1

however if the aligned load intrinsics (_mm_load_ps) are used, it's compiled to

movaps  xmm0, XMMWORD PTR [rdi]
mulps   xmm0, XMMWORD PTR [rsi]

which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well.

vmovups xmm0, XMMWORD PTR [rsi]
vmulps  xmm0, xmm0, XMMWORD PTR [rdi]

Therefor for aligned access although there is no difference in performance when using the instructions movaps and movups on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later.

But there can be a difference in performance when using _mm_loadu_ps and _mm_load_ps intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not movaps vs. movups, it's between movups or folding a load into an ALU instruction. (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a mov* load to get the result in a register for reuse.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • The OP is asking about asm instructions, not load intrinsics. Still, upvoted for a useful related point. (AVX instructions don't require their memory operands to be aligned, but SSE does, so compiling `loadu` intrinsics without AVX can cost you extra instructions which matters even on modern CPUs.) – Peter Cordes Sep 18 '18 at 07:34
  • @PeterCordes, I realized my error before your comment and already fixed it :-) – Z boson Sep 18 '18 at 07:38
  • @PeterCordes Is your edit "Operations which fold a load and another operation such as multiplication into one instruction can only be done with load, not loadu intrinsics." accurate. The fold can be done for `loadu` if it's vex encoded. – Z boson Sep 18 '18 at 08:03
  • That's true, updated if you think it's better to mention AVX there, too. Your original had the same simplification of only talking about SSE at first, I thought that's what you were going for. – Peter Cordes Sep 18 '18 at 08:09
  • @PeterCordes I mostly wanted to point out that some people might take the wrong conclusion about a 1-1 map between the intrinsics and instructions which could have an impact (e.g. just use `loadups` because `movups` and `movaps` don't make a difference - that's not necessarily correct). – Z boson Sep 18 '18 at 08:15
  • 1
    Yes, that's why I upvoted. I think this answer makes it well now. – Peter Cordes Sep 18 '18 at 08:22
  • So guys, sorry, let me get it straight. The reason it compiled into 3 instructions for unaligned memory is because `mulps` supports only aligned memory operand, correct? – MikeF Sep 20 '18 at 05:18
  • 1
    @MikeF read-modify (e.g. mul + read) operations require aligned memory with SSE but not with with AVX. – Z boson Sep 20 '18 at 07:18
  • @PeterCordes, something has been bothering me about changing my answer to intrinsics. The OP may really want to know about read operations in general and not just specific read instructions. At a more basic level the aligned and unaligned micro-ops have no performance difference for aligned memory. However, there is a difference on how those micro-ops are used as either a single read or a modify-read. The problem is that people normally only think of pure reads and not modify-read. And I think the OP probably wanted to know about both. – Z boson Sep 20 '18 at 07:29
  • If you write `movaps` in asm, nothing can fold it into a memory operand. That can only happen at *compile* time with a `load` intrinsic, not at assemble-time or run-time with `movaps`. `movaps xmm0, [rdi]` / `addps xmm1, xmm0` does *not* get micro-fused at runtime. That's why this answer *needs* to use intrinsics not asm mnemonics to correctly explain how the compiler can optimize. Or with the blocks of asm output, to show how the optimization removes the `movaps` instruction entirely, replacing with an instruction that will decode into micro-fused (Intel) or just a memory-source (AMD). – Peter Cordes Sep 20 '18 at 07:43
  • Or did you mean that you want to explain more about micro-fusion, to show how on Intel SnB-family CPUs, an instruction with a memory operand takes 2 entries in the RS to hold its 2 unfused-domain uops? (Fun fact: a micro-fused load+ALU uop on P6 family fits in one RS entry, even though they still need to dispatch to separate execution ports. [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/a/51989771)) – Peter Cordes Sep 20 '18 at 07:48
  • @PeterCordes, I meant simply that there are at least two kinds SIMD read instructions: read only and modify+read. And for SSE there is an asymmetry between aligned and unaligned such that only aligned modify+reads are possible. The OP only asked about read instructions but may have wanted to know about read+modify instructions also. – Z boson Sep 20 '18 at 07:58
  • Oh, yes I agree, it's an often-parroted answer that Nehalem has efficient unaligned loads, and then incorrectly extrapolating from `movups` being efficient to `_mm_loadu_ps` being efficient. All my edits were to make that point *more* clearly. It's micro-fusion that makes it a win (or AMD's behaviour of only needing 1 m-op / uop even with a mem src). In terms of uops, `addps xmm1, [rdi]` decodes on Intel to the same kind of uop as a `movaps` load, micro-fused with the same kind of uop as `addps xmm0,xmm1`, so you could add that in, but your first version seemed confused about isns vs. intrin – Peter Cordes Sep 20 '18 at 08:47