Are word-aligned loads faster than unaligned loads on x64 processors?

Question

Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?

A colleague of mine argues that unaligned loads are slow and should be avoided. He cites the padding of items to word boundaries in structs as a proof that unaligned loads are slow. Example:

struct A {
  char a;
  uint64_t b;
};

The struct A as usually a size of 16 bytes.

On the other hand, the documentation of the Snappy compressor states that Snappy assumes that "unaligned 32- and 64-bit loads and stores are cheap". According to the source code this is true of Intel 32 and 64-bit processors.

So: What is the truth here? If and by how much are unaligned loads slower? Under which circumstances?

The default structure packing is 8, so the A::b member is in fact aligned. Misaligned members can straddle the cache line and that's always expensive. — Hans Passant, Feb 20 '12 at 16:19
Related: some latency and throughput timing results from Skylake and Haswell: [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/q/45128763) — Peter Cordes, Dec 20 '18 at 18:51

score 7 · Answer 1 · answered Feb 20 '12 at 16:17

7

A Random Guy On The Internet I've found says that for the 486 says that an aligned 32-bit access takes one cycle. An unaligned 32-bit access that spans quads but is within the same cache line takes four cycles. An unaligned etc that spans multiple cache lines can take an extra six to twelve cycles.

Given that an unaligned access requires accessing multiple quads of memory, pretty much by definition, I'm not at all surprised by this. I'd imagine that better caching performance on modern processors makes the cost a little less bad, but it's still something to be avoided.

(Incidentally, if your code has any pretensions to portability... ia32 and descendants are pretty much the only modern architectures that support unaligned accesses at all. ARM, for example, can very between throwing an exception, emulating the access in software, or just loading the wrong value, depending on OS!)

Update: Here's someone who actually went and measured it. On his hardware he reckons unaligned access to be half as fast as aligned. Go try it for yourself...

answered Feb 20 '12 at 16:17

David Given

13,277
9
76
123

Some ARM variants cause an exception on unaligned accesses, but others will decompose them into smaller parts. On the Cortex M3, a word(32) load/store on a halfword(16) boundary will be decomposed into two halfword parts; a word load/store on a byte boundary will be decomposed into three: two byte accesses and a word access. Note that not all instructions allow unaligned accesses. – supercat Feb 20 '12 at 17:12
2

On recent Intel x86 (Nehalem and newer), unaligned loads and stores only have a penalty when you cross a cache line (or worse, a page line). See http://agner.org/optimize/ for the microarch guide with the details. It can be worth adding a prologue to loops, to do unaligned until you reach an aligned address, so the main loop runs on aligned data, if you're processing every byte. – Peter Cordes Sep 01 '15 at 20:28
3

This is old info, unaligned loads and stores have a very small penalty now: http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/ – Eloff Feb 04 '16 at 05:11
[Modern MIPS (MIPS64 / MIPS64 r6)](https://en.wikipedia.org/wiki/MIPS_architecture#MIPS32/MIPS64_Release_6) removes the unaligned split load/store instructions, and requires that implementations support unaliged addresses for the normal `lw` / `sw` instructions. As transistor budgets grow even for embedded CPUs, more and more of them support unaligned accesses efficiently. It's useful for compression algorithms, among other things. – Peter Cordes Jan 22 '18 at 07:37

Necrolis · Answer 2 · 2012-02-20T16:20:02.077

Aligned loads are stores are faster, two excerpts from the Intel Optimization Manual cleanly point this out:

3.6 OPTIMIZING MEMORY ACCESSES

Align data, paying attention to data layout and stack alignment

...

Alignment and forwarding problems are among the most common sources of large delays on processors based on Intel NetBurst microarchitecture.

AND

3.6.4 Alignment

Alignment of data concerns all kinds of variables:

• Dynamically allocated variables

• Members of a data structure

• Global or local variables

• Parameters passed on the stack

Misaligned data access can incur significant performance penalties. This is particularly true for cache line splits.

Following that part in 3.6.4, there is a nice rule for compiler developers:

Assembly/Compiler Coding Rule 45. (H impact, H generality) Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries.

followed by a listing of alignment rules and another gem in 3.6.6

User/Source Coding Rule 6. (H impact, M generality) Pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary.

Both rules are marked as high impact, meaning they can greatly change performance, along with the excerpts, the rest of Section 3.6 is filled with other reasons to naturally align your data. Its well worth any developers time to read these manuals, if only to understand the hardware he/she is working on.

If you can guarantee that your unaligned load/store doesn't cross a cache line boundary, there's no penalty on modern Intel. (On modern AMD, maybe a 32-byte or 16-byte boundary). Usually by far the easiest way to avoid cache-line splits is natural alignment, though, but if you have a 64-byte aligned struct, then having misaligned fields within it is fine. — Peter Cordes, Jan 22 '18 at 07:40

Jonathan Leffler · Answer 3 · 2012-02-20T17:05:33.383

2

To fix up a misaligned read, the processor needs to do two aligned reads and fix up the result. This is slower than having to do one read and no fix-ups.

The Snappy code has special reasons for exploiting unaligned access. It will work on x86_64; it won't work on architectures where unaligned access is not an option, and it will work slowly where fixing up unaligned access is a system call or a similarly expensive operation. (On DEC Alpha, there was a mechanism approximately equivalent to a system call for fixing up unaligned access, and you had to turn it on for your program.)

Using unaligned access is an informed decision that the authors of Snappy made. It does not make it sensible for everyone to emulate it. Compiler writers would be excoriated for the poor performance of their code if they used it by default, for example.

edited Feb 20 '12 at 17:05

answered Feb 20 '12 at 16:18

Jonathan Leffler

730,956
141
904
1,278

Isn't the question about "x86/64 (Intel/AMD 64 bit) processors?" Do x86 processors really do two aligned reads per unaligned read? Is there documentation supporting that claim? – SO_fix_the_vote_sorting_bug May 24 '21 at 15:29
I'm mostly discussing processors other than Intel/AMD — such as SPARC and PowerPC. And the rules there have probably changed too — PowerPC was a big-endian system but now runs as little-endian, or can be configured to run as little-endian. – Jonathan Leffler May 24 '21 at 16:21

score 2 · Answer 4 · answered Feb 20 '12 at 16:22

2

Unaligned loads/stores should never be used, but the reason is not performance. The reason is that the C language forbids them (both via the alignment rules and the aliasing rules), and they don't work on many systems without extremely slow emulation code - code which may also break the C11 memory model needed for proper behavior of multi-threaded code, unless it's done on a purely byte-by-byte level.

As for x86 and x86_64, for most operations (except some SSE instructions), misaligned load and store are allowed, but that doesn't mean they're as fast as correct accesses. It just means the CPU does the emulation for you, and does it somewhat more efficiently than you could do yourself. As an example, a memcpy-type loop that's doing misaligned word-size reads and writes will be moderately slower than the same memcpy doing aligned access, but it will also be faster than writing your own byte-by-byte copy loop.

answered Feb 20 '12 at 16:22

R.. GitHub STOP HELPING ICE

208,859
35
376
711

1

Suppose one wishes to copy 64Kbytes of data where the source and destination are aligned differently. What would you consider to be the tradeoffs between (1) copy as bytes; (2) align either the source or destination, and copy as longwords with one aligned and one unaligned pointer; (3) align either the source or destination, and manipulate that as words and the other part as bytes or halfwords; (4) manipulate both source and destination as words, using bit-shifting as needed to combine source and destination. Bear in mind that what's fast on today's CPU's may be slow on tomorrow's. – supercat Feb 20 '12 at 17:16
Unless you're the one implementing the system, I would use the system's `memcpy`. It's likely to be using whatever is known to be fastest, and perhaps more importantly you don't have to worry that the compiler will figure out you broke the aliasing rules and thereby break your code. – R.. GitHub STOP HELPING ICE Feb 20 '12 at 17:20
@R: Fair point about memcpy in the case where one will be simply copying data. What if one will be doing something just a little more complicated, e.g. an equivalent to--assuming bytes-- `while(n--) *dest++ ^= *src++;` If both have identical alignment, clearly using words for most of the operation should allow a major speedup, but what would be the most reasonable pattern for coding such a thing? – supercat Feb 20 '12 at 17:23
1

Again, due to aliasing rules, you're going to have to tiptoe around manipulating data as anything other than it's actual type or a `char` type, but for the time being it seems safe enough to do that either by ensuring the function is external and not subject to LTO, or perhaps using `volatile`. With that problem solved, for most purposes i would manipulate the buffers as `size_t` if they're aligned, and `unsigned char` otherwise. You could probably get some more performance doing clever stuff when the alignment doesn't match, but I generally favor simplicity unless the performance is critical. – R.. GitHub STOP HELPING ICE Feb 20 '12 at 17:27
Also note that using `size_t` is not purely portable, as it could have padding bits. In that case you might prefer to pick the largest `uintXX_t` smaller than or equal in size to `size_t` using some preprocessor/`limits.h` trickery. – R.. GitHub STOP HELPING ICE Feb 20 '12 at 17:29
@R: My thought was about graphics routines using 8-bit pixels. If things can be located on single-pixel boundaries, graphical block-transfer operations are often going to have to deal with mixed alignments. Such operations do move around a fair amount of data, and they do it frequently, so it would be desirable for such operations to be pretty fast. Your point about aliasing is interesting; if one aligns source and destination, writing each destination word using bit-sliced combination of two source words, would there be aliasing issues, if everything was written using either bytes or words? – supercat Feb 20 '12 at 17:38
The aliasing issue is just that you can't access data of one type as another, and the compiler is free to assume you're not doing that, which can cause it to misorder reads/writes if the assumption is broken. However, `char` types are allowed to alias anything, and if the only 2 ways you're accessing the data are as a `char` type and a `uintXX_t` type with no padding, I think the aliasing issues disappear and it's just the alignment issue that remains. – R.. GitHub STOP HELPING ICE Feb 20 '12 at 19:03
@R: My particular thought was how to code something like a graphics BLiT routine for an ARM system (presently Cortex M3), if one assumes source operands are aligned and destination (the screen) may or may not be. On some ARM systems, unaligned accesses are permissible. If the source and mask are read into registers, an unaligned load, cookie-cut and unaligned store would have two penalty cycles if halfword-aligned; four penalty cycles if not. If source and mask have to be shifted and munged, that would add four cycles per word, but would be portable to more ARM devices. – supercat Feb 20 '12 at 20:32
The proper way to express an unaligned `int` load/store to a `char[]` is with `memcpy`. Modern compilers targeting x86 will compile it to a single load or store instruction, and can even auto-vectorize. Even on x86, unaligned pointers can cause breakage with auto-vectorization: https://stackoverflow.com/questions/47510783/why-does-unaligned-access-to-mmaped-memory-sometimes-segfault-on-amd64. So yes, your overall point that it's still breaking C11 rules (by having less alignment than `alignof(int)`) applies even to x86, even though it happens to work most of the time there. – Peter Cordes Jan 22 '18 at 07:45

score 0 · Accepted Answer · answered Feb 09 '15 at 22:55

Unaligned 32 and 64 bit access is NOT cheap.

I did tests to verify this. My results on Core i5 M460 (64 bit) were as follows: fastest integer type was 32 bit wide. 64 bit alignment was slightly slower but almost the same. 16 bit alignment and 8 bit alignment were both noticeably slower than both 32 and 64 bit alignment. 16 bit being slower than 8 bit alignment. The by far slowest form of access was non aligned 32 bit access that was 3.5 times slower than aligned 32 bit access (fastest of them) and unaligned 32 bit access was even 40% slower than unaligned 64 bit access.

Results: https://github.com/mkschreder/align-test/blob/master/results-i5-64bit.jpg?raw=true Source code: https://github.com/mkschreder/align-test

Are word-aligned loads faster than unaligned loads on x64 processors?

5 Answers5

Linked