7

Many guides to low latency development discuss aligning memory allocations on particular address boundaries:

https://github.com/real-logic/simple-binary-encoding/wiki/Design-Principles#word-aligned-access

http://www.alexonlinux.com/aligned-vs-unaligned-memory-access

However, the second link is from 2008. Does aligning memory on address boundaries still provide a performance improvement on Intel CPUs in 2019? I thought Intel CPUs no-longer incur a latency penalty accessing unaligned addresses? If not, under what circumstances should this be done? Should I align every stack variable? Class member variable?

Does anybody have any examples where they have found a significant performance improvement from aligning memory?

intrigued_66
  • 16,082
  • 51
  • 118
  • 189
  • Are you asking of cache lines still exist? About SIMD? Or is it "is there any performance hit ever? (a: yes) and What are all of the performance hits? (a: too broad) – Yakk - Adam Nevraumont Jan 05 '19 at 06:16
  • 1
    Some earlier results [here](https://stackoverflow.com/q/45128763/555045), anyway it is not so much misalignment that is the problem but crossing certain boundaries (eg 64 byte, 4K, 16 byte on AMD) – harold Jan 05 '19 at 06:17
  • A similar [question](https://stackoverflow.com/questions/18113995/performance-optimisations-of-x86-64-assembly-alignment-and-branch-prediction). – 1201ProgramAlarm Jan 05 '19 at 06:19
  • 1
    `Should I align every stack variable?` No. Most variables are not performance sensitive. – eerorika Jan 05 '19 at 06:31
  • `C++` implementations already align their variables. Even dynamic allocation is type specific and structures get padding to make the members aligned. The implementation get to decide that on platforms that support unaligned memory access but I think, unless you tell your compiler to optimize for space rather than speed, you should be good. – Galik Jan 05 '19 at 06:34

2 Answers2

10

The penalties are usually small, but crossing a 4k page boundary on Intel CPUs before Skylake has a large penalty (~150 cycles). How can I accurately benchmark unaligned access speed on x86_64 has some details on the actual effects of crossing a cache-line boundary or a 4k boundary. (This applies even if the load / store is inside one 2M or 1G hugepage, because the hardware can't know that until after it's started the process of checking the TLB twice.) e.g in an array of double that was only 4-byte aligned, at a page boundary there'd be one double that was split evenly across two 4k pages. Same for every cache-line boundary.

Regular cache-line splits that don't cross a 4k page cost ~6 extra cycles of latency on Intel (total of 11c on Skylake, vs. 4 or 5c for a normal L1d hit), and cost extra throughput (which can matter in code that normally sustains close to 2 loads per clock.)

Misalignment without crossing a 64-byte cache-line boundary has zero penalty on Intel. On AMD, cache lines are still 64 bytes, but there are relevant boundaries within cache lines at 32 bytes and maybe 16 on some CPUs.

Should I align every stack variable?

No, the compiler already does that for you. x86-64 calling conventions maintain a 16-byte stack alignment so they can get any alignment up to that for free, including 8-byte int64_t and double arrays.

Also remember that most local variables are kept in registers for most of the time they're getting heavy use. Unless a variable is volatile, or you compile without optimization, the value doesn't have to be stored / reloaded between accesses.

The normal ABIs also require natural alignment (aligned to its size) for all the primitive types, so even inside structs and so on you will get alignment, and a single primitive type will never span a cache-line boundary. (exception: i386 System V only requires 4 byte alignment for int64_t and double. Outside of structs, the compiler will choose to give them more alignment, but inside structs it can't change the layout rules. So declare your structs in an order that puts the 8-byte members first, or at least laid out so they get 8-byte alignment. Maybe use alignas(8) on such struct members if you care about 32-bit code, if there aren't already any members that require that much alignment.)

The x86-64 System V ABI (all non-Windows platforms) requires aligning arrays by 16 if they have automatic or static storage outside of a struct. maxalign_t is 16 on x86-64 SysV so malloc / new return 16-byte aligned memory for dynamic allocation. gcc targeting Windows also aligns stack arrays if it auto-vectorizes over them in that function.


(If you cause undefined behaviour by violating the ABI's alignment requirements, it often doesn't make any performance different. It usually doesn't cause correctness problems x86, but it can lead to faults for SIMD type, and with auto-vectorization of scalar types. e.g. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?. So if you intentionally misalign data, make sure you don't access it with any pointer wider than char*. e.g. use memcpy(&tmp, buf, 8) with uint64_t tmp to do an unaligned load. gcc can autovectorize through that, IIRC.)


You might sometimes want to alignas(32) or 64 for large arrays, if you compile with AVX or AVX512 enabled. For a SIMD loop over a big array (that doesn't fit in L2 or L1d cache), with AVX/AVX2 (32-byte vectors) there's usually near-zero effect from making sure it's aligned by 32 on Intel Haswell/Skylake. Memory bottlenecks in data coming from L3 or DRAM will give the core's load/store units and L1d cache time to do multiple accesses under the hood, even if every other load/store crosses a cache-line boundary.

But with AVX512 on Skylake-server, there is a significant effect in practice for 64-byte alignment of arrays, even with arrays that are coming from L3 cache or maybe DRAM. I forget the details, it's been a while since I looked at an example, but maybe 10 to 15% even for a memory-bound loop? Every 64-byte vector load and store will cross a 64-byte cache line boundary if they aren't aligned.

Depending on the loop, you can handle under-aligned inputs by doing a first maybe-unaligned vector, then looping over aligned vectors until the last aligned vector. Another possibly-overlapping vector that goes to the end of the array can handle the last few bytes. This works great for a copy-and-process loop where it's ok to re-copy and re-process the same elements in the overlap, but there are other techniques you can use for other cases, e.g. a scalar loop up to an alignment boundary, narrower vectors, or masking. If your compiler is auto-vectorizing, it's up to the compiler to choose. If you're manually vectorizing with intrinsics, you get to / have to choose. If arrays are normally aligned, it's a good idea to just use unaligned loads (which have no penalty if the pointers are aligned at runtime), and let the hardware handle the rare cases of unaligned inputs so you don't have any software overhead on aligned inputs.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • What does ABI stand for? – Isaak Eriksson Jan 05 '19 at 07:25
  • 2
    @IsaakEriksson: Application Binary Interface. A calling convention is part of an ABI, but an ABI also includes the rules for what size and minimum alignment each types has (e.g. that `long` is 64 bits in x86-64 System V, but `long` is 32 bits in Windows x64), and struct-packing rules. Also any other requirements like metadata for stack unwinding on exceptions, or frame-pointer / stack-frame layout rules for the same purpose, and how the GOT / PLT works for dynamic linking, etc. etc. See [Where is the x86-64 System V ABI documented?](https://stackoverflow.com/q/18133812) – Peter Cordes Jan 05 '19 at 07:29
  • It is important to consider not just “the latency” for an unaligned load or store but also the fact that it may use more resources than an aligned load or store, such as spots in internal queues or shifters to move the unaligned data. Even though the hardware might be able to get one such load or store done with the same latency as an aligned load or store, it might not be able to sustain the same throughput as with aligned loads or stores, since more resources are used. – Eric Postpischil Jan 05 '19 at 13:50
  • @EricPostpischil: on modern x86, that's only the case for cache-line splits. (Or on AMD, for crossing a 32 or maybe 16-byte boundary). Within a cache line, a load port can handle everything for an unaligned load. But yes there is a throughput cost as well for cache-line split loads. They each take two L1d read ports, cutting max throughput in half. There's also a perf counter on Intel for `ld_blocks.no_sr`: *[The number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use]*; there are limited split-load buffers. – Peter Cordes Jan 05 '19 at 18:15
  • That might be true on other ISAs like ARM or MIPS (the modern versions of which do support unaligned loads), but this is an `[x86]` question. – Peter Cordes Jan 05 '19 at 18:18
  • What do you mean by a 4k boundary? Is this related to the cache associativity? Is there a hidden "boundary" between 64 bytes and 32kb (L1 data cache size), when you say is 4kb? – intrigued_66 Jan 07 '19 at 01:56
  • @mezamorphic: 4kiB is the page size. If you load or store 4 bytes from an address that's a 1 byte less than a multiple of 4096, the first byte comes from one page, the next 3 bytes from another page. Same for a cache line and one less than a multiple of 64. e.g. `(addr & -64) - 1` would create a misaligned address. – Peter Cordes Jan 07 '19 at 03:21
-1

Of course there is a penality, You have to align the structs members on word border to maximize