Performance of unaligned SIMD load/store on aarch64

Question

An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operations, too.

Relative to aligned 128-bit NEON loads and stores, how much slower (if at all) are unaligned 128-bit NEON loads and stores on aarch64?

Are there separate instructions for unaligned SIMD loads and stores (as is the case with SSE2) or are the known-aligned loads/stores the same instructions as potentially-unaligned loads/stores?

I don't have arm64 but it's quiet easy to test if you have. try to read [this](http://infocenter.arm.com/help/topic/com.arm.doc.ihi0053b/IHI0053B_arm_c_language_extensions_2013.pdf) and provide your answer by yourself — Amiri, Aug 17 '17 at 12:56
Using Rust to test, it seems that at least of Raspberry Pi 3, potentially-unaligned reads (LLVM built-in `memcpy`) from aligned addresses is as fast as aligned reads (LLVM pointer deref) from aligned addresses. — hsivonen, Aug 20 '17 at 08:39
Looking at the assembly from LLVM, it looks like aligned and unaligned loads are the same instructions. — hsivonen, Aug 23 '17 at 12:27
My input is Rust and rustc uses LLVM instead of GCC as the codegen back end. — hsivonen, Aug 23 '17 at 12:31
OK, I'm not familiar with Rust, I've just heard a music album `Rust Never Sleeps` — Amiri, Aug 23 '17 at 12:32
Are you sure your `memcpy` test is actually doing unaligned accesses, instead of getting to an alignment boundary and then using aligned? If the dst is misaligned relative to the src, it has to do something, but that "something" could include an ALU shuffle so you're doing aligned loads and aligned stores (like x86 SSSE3 `palignr`) — Peter Cordes, Aug 26 '17 at 07:52
@Martin: Are you sure it's a good assumption that it will perform the same on different ARM cores? (Else it's not "easy to test"). That's not the case on x86, where Core2 has a performance penalty for using `movups` even on data that's aligned at runtime. But on Nehalem and later, unaligned-load instructions on data that happens to be aligned is exactly as fast as using the alignment-required loads (like `movaps`). (This was only for vector loads. Core2 had efficient unaligned integer loads.) Cache-line-split and page-split penalties vary a lot across different x86 microarchitectures. — Peter Cordes, Aug 26 '17 at 07:57
Looking at the clang headers, `memcpy` where one operand is address of a vector-typed variable and the number of bytes to copy matches the byte length of the vector type is the LLVM idiom to request unaligned loads/stores. I've verified this from assemby in the SSE2 case using Rust. Additionally, in the Aarch64 NEON case, the loads and stores that are generated are as simple as the loads/stores generated for aligned pointer derefs, and the resulting code doesn't crash. — hsivonen, Aug 28 '17 at 10:08

score 9 · Accepted Answer · answered Aug 29 '17 at 11:36

According to the Cortex-A57 Software Optimization Guide in section 4.6 Load/Store Alignment it says:

The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:

Load operations that cross a cache-line (64-byte) boundary

Store operations that cross a 16-byte boundary

So it may depend on the processor that you are using, out of order (A57, A72, A-72, A-75) or in-order (A-35, A-53, A-55). I didn't find any optimization guide for the in-order processors, however they do have a Hardware Performance Counter that you could use to check if the number of unaligned instructions do affect performance:

    0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store

This can be used with the perf tool.

There are no special instructions for unaligned accesses in AArch64.

score 7 · Answer 2 · answered Nov 11 '18 at 02:09

If a load/store has to be split or crosses a cache line, at least one extra cycle is required.

There are exhaustive tables that specify the number of cycles required for various alignments and numbers of registers for the Cortex-A8 (in-order) and Cortex-A9 (partially OoO). For example, vld1 with one reg has a 1-cycle penalty for unaligned access vs. 64 bit-aligned access.

The Cortex-A55 (in-order) does up to 64-bit loads and 128-bit stores, and accordingly, section 3.3 of its optimization manual states a 1-cycle penalty is incurred for:

• Load operations that cross a 64-bit boundary
• 128-bit store operations that cross a 128-bit boundary

The Cortex-A75 (OoO) has penalties per section 5.4 of its optimization guide for:

• Load operations that cross a 64-bit boundary.
• In AArch64, all stores that cross a 128-bit boundary.
• In AArch32, all stores that cross a 64-bit boundary.

And as in Guillermo's answer, the A57 (OoO) has penalties for:

• Load operations that cross a cache-line (64-byte) boundary
• Store operations that cross a [128-bit] boundary

I'm somewhat skeptical that the A57 does not have a penalty for crossing 64-bit boundaries given that the A55 and A75 do. All of these have 64-byte cache lines; they should all have penalties for crossing cache lines too. Finally, note that there's unpredictable behavior for split access crossing pages.

From some crude testing (without perf counters) with a Cavium ThunderX, there seems to be closer to a 2-cycle penalty, but that might be an additive effect of having back-to-back unaligned loads and stores in a loop.

AArch64 NEON instructions don't distinguish between aligned and un-aligned (see LD1 for example). For AArch32 NEON, alignment is specified statically in the addressing (VLDn):

vld1.32 {d16-d17}, [r0]    ; no alignment
vld1.32 {d16-d17}, [r0@64] ; 64-bit aligned
vld1.32 {d16-d17}, [r0:64] ; 64 bit-aligned, used by GAS to avoid comment ambiguity

I don't know if aligned access without alignment qualifiers performs any slower than access with alignment qualifiers on recent chips running in AArch32 mode. Somewhat old documentation from ARM encourages using qualifiers whenever possible. (Intel refined their chips such that unaligned and aligned moves perform the same when the address is aligned, by comparison.)

If you're using intrinsics, MSVC has _ex-suffixed variants that accept the alignment. A reliable way to get GCC to emit an alignment qualifier is with __builtin_assume_aligned.

// MSVC
vld1q_u16_ex(addr, 64);
// GCC:
addr = (uint16_t*)__builtin_assume_aligned(addr, 8);
vld1q_u16(addr);

Rauli Kumpulainen · Answer 3 · 2019-04-03T20:27:05.947

1

Alignment hints are not used on aarch64. They are transparent. If the pointer is aligned to the datatype size, the performance benefit is automatic.

If in doubt, for GCC/Clang use __attribute__((__aligned__(16))) on variable declarations.

edited Apr 03 '19 at 20:27

answered Apr 03 '19 at 20:15

Rauli Kumpulainen

336
1
8

Performance of unaligned SIMD load/store on aarch64

3 Answers3

Linked