2

I'm starting to learn a little bit abour SIMD intrinsics. I noticed that for some functions there is an aligned and an unaligned version, for example _mm_store_si128 and _mm_storeu_si128. My question is, do these functions perform differently, and if not why are two different versions?

Paul R
  • 208,748
  • 37
  • 389
  • 560
user1235183
  • 3,002
  • 1
  • 27
  • 66
  • 3
    [SSE: why, technically, is 16-aligned data faster to move?](http://stackoverflow.com/q/24963646/995714), [Purpose of memory alignment](http://stackoverflow.com/q/381244/995714), [Are word-aligned loads faster than an unaligned loads on x64 processors](http://stackoverflow.com/q/9364159/995714) – phuclv Aug 18 '15 at 07:43

3 Answers3

2

On older CPUs there is a substantial performance difference between aligned and unaligned loads/stores. On more recent CPUs the difference is much less significant, but as a "rule of thumb" you should still prefer the aligned version wherever possible.

Paul R
  • 208,748
  • 37
  • 389
  • 560
2

I'd say "always align (wherever possible)", this way you are covered no matter what. Some platforms do not support unaligned access, others will have substantial performance degradation. If you go for aligned access you will have optimal performance in any case. There might be a small cost of memory on some platforms, but it is well worth it, because if you go SIMD that means you go for performance. I can think of no reason why one should implement unaligned code path. Maybe if you have to deal with some old design, which wasn't built with SIDM in mind, but I'd say the odds of that are slim to none.

I'd say the same applies to scalars as well, proper alignment is proper in any case, and saves you some trouble when achieving optimal performance...

As of why unaligned access might be slower or even unsupported - it is because of how hardware works. Say you have a 64bit integer, and a 64bit memory controller, if your integer is properly aligned, the memory controller can access it in a single swoop. But if it is offset, the memory controller will have to do 2 operations, plus the CPU may need to shift data around to compose it properly. And since that is suboptimal, some platforms don't even support it implicitly, as the means to enforce efficiency.

dtech
  • 47,916
  • 17
  • 112
  • 190
  • 2
    You're right in general, but there are several situations where an unaligned load/store is the best way to go, even on older CPUs. – Paul R Aug 18 '15 at 07:47
  • @PaulR - care to elaborate? After all, it was in the question, but your answer didn't reflect on that at all. – dtech Aug 18 '15 at 07:48
  • 1
    OK - I was a bit time-constrained earlier, but one example is neighbourhood operations on pre-SSSE3 CPUs - there were no useful double vector horizontal shift instructions prior to MNI, so unaligned loads were often a necessary evil. There are also cases where unaligned access is more convenient and the latency can be hidden if there are enough other operations going on. Other examples: supporting external APIs (e.g. if you want to write a SIMD-optimised memcpy or other standard/legacy function). – Paul R Aug 18 '15 at 08:41
  • @PaulR: Good explanation, but I'd add that for large buffers, you'd usually want do aligned ops for most of the data. Either do smaller moves until you get to an alignment boundary, and do aligned for the rest. Or for something like `memcpy`, do one 16B unaligned copy, then `p &= ~15UL` to align the pointer (rounding down), then a loop of 16B aligned copies. (The first of these may overlap with the unaligned first 16B.) This gives you aligned performance without branches and code bloat. (You still need a cleanup loop for the last up-to-15B either way.) – Peter Cordes Aug 18 '15 at 16:27
  • @PeterCordes - speaking entirely out through common sense, I'd say for larger buffers you will get a smaller perf penalty than for isolated access. Because for large buffers, half of what you "miss" will be needed for the next iteration, so it will be in cache. Whereas for isolated vectors half of your cache will be garbage. – dtech Aug 18 '15 at 16:30
  • @ddriver: for a large buffer, every 4th read will cross a cache line if they're unaligned. There's still a penalty for that. If you're not limited by L1 cache bandwidth (read / write port uops) or memory latency, it's not so much of an issue. For a small buffer, if 2 unaligned reads get all your data, it's not worth doing any extra work to make one of the two loads aligned. Usually it will take some extra startup code to get the rest of your loads aligned, which isn't worth the cost if there's not enough benefit to amortize it. – Peter Cordes Aug 18 '15 at 16:32
  • @PeterCordes - this is true for contemporary hardware, but the perf degradation is still less than what you would experience from misaligned single vectors.You will most likely get cache garbage for every access, plus the issue you make note of. – dtech Aug 18 '15 at 16:35
  • Plenty of byte-oriented kernels fundamentally deal with unaligned data. I'm not talking about simple memcpy or other streaming algorithms where you might have an unaligned start, but it's simple to get into alignment, but cases where you have variably sized elements and need to perform stores or loads at arbitrary alignments. That either means unaligned loads, or a mess of variable shuffles or `palignr` or something which on recent hardware is almost always slower. – BeeOnRope Aug 20 '15 at 20:11
  • @BeeOnRope - that just doesn't sound like something that will benefit from SIMD. SIMD is for HPC, and HPC workloads are always aligned properly. That's why CPUs also have ALUs... – dtech Aug 20 '15 at 22:04
  • 2
    On the contrary non-HPC codes can and do benefit immensely from SIMD. Take a look at integer compression codes or DNA processing. I'm sorry but if you think SIMD is (only) for HPC you are dead wrong (there's a reason SIMD instruction sets have been put on consumer and enterprise CPUs and marketed to consumers and business from day one). If you think that HPC workloads only make use of aligned loads, you are also wrong (although most do) - the whole HPC space isn't LINPACK and friends. – BeeOnRope Aug 20 '15 at 22:09
2

If the data is in fact aligned, an unaligned load / store will be identical performance to an aligned store.

  • unaligned ops: Unaligned data will cause a small performance hit, but your program still works.

  • aligned ops: Unaligned data will cause a fault, letting you detect accidentally-unaligned data instead of silently causing a performance hit.

Modern CPUs have very good support for unaligned loads, but there's still a significant performance hit when a load crosses a cache-line boundary.

When using SSE, aligned loads can be folded into other operations as a memory operand. This improves code size and throughput slightly.

When using AVX, both kinds of loads can be folded into other operations. (AVX default behaviour is to allow unaligned memory operands). If aligned loads don't get folded, and produce a movdqa or movaps, then they will still fault on unaligned addresses. This applies even to VEX-encoding of 128bit ops, which you get with the right compile options with no source changes to code using 128b intrinsics.

For getting started with intrinsics, I'd suggest always using unaligned load/store intrinsics. (But try to have your data aligned at least in the common case). Use aligned when performance tuning if you're worried that unaligned data is causing a problem.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    "If the data is in fact aligned, an unaligned load / store will be identical performance to an aligned store" -- only for Sandy Bridge and later, I think ? (Not sure about the AMD story on this.) – Paul R Aug 18 '15 at 16:44
  • 2
    @PaulR: Actually Nehalem and later, according to the instruction tables on http://agner.org/optimize/. Same story on AMD Bulldozer or later. (Even K10 has no penalty for loads, but unaligned stores to aligned addresses are slower). So at this point, hardware with a perf penalty for using unaligned loads/stores on aligned data is thoroughly obsolete. The main reason to use aligned load intrinsics at this point is to allow folding loads into memory operands, because AVX isn't close to being universally available yet. (Even Silvermont doesn't have it.) – Peter Cordes Aug 18 '15 at 17:01
  • Thanks for the clarification - I wasn't 100% certain. – Paul R Aug 18 '15 at 20:18
  • Not yet discussed is that for **memcpy** given arbitrary pointers, in addition to whatever the respective VM alignments, `*psrc` and `*pdst` may also not be ***mutually*** quad-aligned. In this case, there would be four options: a.) quad-align to source/load only, b.) quad-align to destination/store only, c.) quad-align to neither, or d.) quad-align source/load, shift SIMD register, quad-align destination/store. Can you briefly comment on the trade-offs and any recommendations for this general case? Thanks. – Glenn Slayden Mar 11 '23 at 08:04
  • @GlennSlayden: On old CPUs like Core 2, `palignr` was worth using so you could do aligned loads and aligned stores, grabbing the correct 16-byte windows from 16-byte aligned loads. But on CPUs with efficient unaligned loads/stores, IIRC, it's normally best to align the destination if you can do that cheaply; many x86 CPUs can do 2 loads per clock (with split loads counting as 2 for some purposes) but only commit 1 store per clock to cache. Since stores probably don't coalesce in the store buffer, that might limit you to less than 1 vector store per clock cycle. – Peter Cordes Mar 11 '23 at 08:16
  • @GlennSlayden: But it's been a while since I looked at this so you'd have to double-check. It might be best not to spend extra instructions worrying about either alignment, depending on the use-case. See also [How can I accurately benchmark unaligned access speed on x86\_64?](https://stackoverflow.com/a/45129784) for a lot more detail on unaligned load / store. – Peter Cordes Mar 11 '23 at 08:18
  • @GlennSlayden: See also comments on [Unaligned load versus unaligned store](https://stackoverflow.com/q/40919766) for what I and others were thinking in 2016. – Peter Cordes Mar 11 '23 at 08:19