What's the actual effect of successful unaligned accesses on x86?

Question

I always hear that unaligned accesses are bad because they will either cause runtime errors and crash the program or slow memory accesses down. However I can't find any actual data on how much they will slow things down.

Suppose I'm on x86 and have some (yet unknown) share of unaligned accesses - what's the worst slowdown actually possible and how do I estimate it without eliminating all unaligned accesses and comparing run time of two versions of code?

Rule of thumb: unaligned reads on most architectures result in ~ 2x performance hit compared to an aligned read as it takes two read cycles to get the data and fix it up. Writes are a little more complex. — Paul R, Sep 19 '12 at 09:12
related: [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/a/45129784) has some specific details of the throughput and latency effects of cache-line splits and page splits on modern Intel. — Peter Cordes, Jun 14 '20 at 20:33

Necrolis · Answer 1 · 2012-09-19T09:20:28.530

23

It depends on the instruction(s), for most x86 SSE load/store instructions (excluding unaligned variants), it will cause a fault, which means it'll probably crash your program or lead to lots of round trips to your exception handler (which means almost or all performance is lost). The unaligned load/store variants run at double the amount of cycles IIRC, as they perform partial read/writes, so 2 are required to perform the operation (unless you are lucky and its in cache, which greatly reduces the penalty).

For general x86 load/store instructions, the penalty is speed, as more cycles are required to do the read or write. unalignment may also affect caching, leading to cache line splitting, and cache boundary straddling. It also prevents atomicity on reads and writes (which are guaranteed for all aligned read/writes of x86, barriers and propagation is something else, but using LOCK'ed instruction on unaligned data may cause and exception or greatly increase the already massive penalty the bu lock incurs), which is a no-no for concurrent programming.

Intels x86 & x64 optimizations manual goes into great detail about each aforementioned problem, their side-effects and how to remedy them.

Agner Fog' optimization manuals should have the exact numbers you are looking for in terms of raw cycle throughput.

edited Sep 19 '12 at 09:20

answered Sep 19 '12 at 09:14

Necrolis

25,836
3
63
101

Had a look in the Agner Fog papers but could not find specific numbers. Can you point me at the right page/table? – Nitsan Wakart Jan 17 '13 at 17:48
@NitsanWakart: The unaligned SSE instructions are listed here: http://www.agner.org/optimize/instruction_tables.pdf, the penalties for the penalties to normal instructions you need to consult the appropriate intel chapter in the developer manuals (Chapter 8 or 9 IIRC, at minimum, unaligned reads require double the cycles) – Necrolis Jan 17 '13 at 17:56
I'm specifically looking for penalties to MOV on unaligned (not cacheline straddling) access using recent (post Core2) cpus. In Agner's instruction tables cost I cannot find a penalty, and apart from general advice to align your data I can't find relevant reference in the Intel manuals. – Nitsan Wakart Jan 18 '13 at 10:04
3

@NitsanWakart: 4.1.1 from the Intel Architecture and Instruction set manual states that any unaligned access requires 2 loads/stores, which basically yields double the cycles (but this may vary based on other conditions): `A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles for access.` – Necrolis Jan 18 '13 at 10:33

score 7 · Answer 2 · answered Sep 19 '12 at 09:27

In general estimating speed on modern processors is extremely complicated. This is true not only for unaligned accesses but in general.

Modern processors have pipelined architectures, out of order and possibly parallel execution of instructions and many other things that may impact execution.

If the unaligned access is not supported you get an exception. But if it is supported you may or may not get a slowdown depending on a lot of factors. These factors include what other instructions you were executing both before and after the unaligned one (because the processor may be able to start fetching your data while executing previous instructions or to go ahead and perform subsequent instructions while it waits).

Another very important difference happens if the unaligned access happens across cacheline boundaries. Wile in general a 2x access to the cache may happen for an unaligned access, the real slowdown is if the access crosses a cacheline boundary and causes a double cache miss. In the worst possible case a 2 byte unaligned read may require the processor to flush out two cachelines to memory and then read 2 chachelines from memory. That's a whole lot of data moving.

The general rule for optimization also applies here: first code, then measure, then if and only if there is a problem figure out a solution.

score 7 · Answer 3 · answered Sep 19 '12 at 10:31

On some Intel micro-architectures, a load that is split by a cacheline boundary takes a dozen cycles longer than usual, and a load that is split by a page boundary takes over 200 cycles longer. It's bad enough that if loads are going to be consistently misaligned in a loop, it's worth doing two aligned loads and merging the results manually, even if palignr is not an option. Even SSE's unaligned loads won't save you, unless they are split exactly down the middle.

On AMD's this was never a problem, and the problem mostly disappeared in Nehalem, but there are still a lot of Core2's out there too.

What's the actual effect of successful unaligned accesses on x86?

3 Answers3

Linked