Why data with smaller size than CPU word size need to be aligned at multiple of its size?

Question

Let's assume a 32-bit CPU and 2-byte short. I know 4-byte int needs to be aligned at address multiple of 4 to avoid extra reads.

Questions:

If a short is stored at 0x1, the CPU can still read from 0x0 in one operation. So, why do shorts need to be aligned at an address multiple of 2?
If a short is stored at 0x2, why would it considered aligned and efficient since the CPU can only read from 0x0 and discard the first two bytes?

There is a question that is very similar to this, however, the answer only tells us the alignment requirement is the same for short in the struct and the standalone variable. There is also a comment with 2 upvotes saying:

On many machines, accessing an N-byte quantity (at least for N in {1, 2, 4, 8, 16}) works most efficiently when the quantity is N-byte aligned. It's the way life is; get used to it because I doubt that chip manufacturers are going to change it just because you think it isn't the way it should be.

But why?

If the short starts at 0x3, it needs to read 1 byte form the first 4-byte unit and one byte from the next, and the write-back needs to write to both 4-byte units too. — Jonathan Leffler, Mar 13 '23 at 14:56
Many 32 bit CPUs can read misaligned data, but if so it usually leads to slower code. — Lundin, Mar 13 '23 at 15:01
@JonathanLeffler The question asks when the short starts at 0x1. — user762750, Mar 13 '23 at 15:01
@ScottHunter I think the question lies between hardware and software. Besides, I can find many questions about alignment on stackoverflow. — user762750, Mar 13 '23 at 15:04
Yeah (the question asks about 0x1), but that's not the problem case (or not the only problem case). The compiler has to deal with all cases, and by starting the storage for `short` on an even boundary, it can always access the data in one read. If it is on an odd boundary, it has to look at the current pointer and decide whether to one or two reads and/or one or two writes, and the performance will be abysmal. So the compiler ensures that the `short` address are on an even boundary so that it doesn't have to the extra work. (Packed structures have different rules.) — Jonathan Leffler, Mar 13 '23 at 15:06
An aligned short (2-byte) is always inside a single int (4-byte) which is a pretty guarantee to have. If you allow for unaligned short, you really only are allowing alignment at 1 mod 4, since if the short is at 3 mod 4 then it would cross a 4-byte boundary. So it would be confusing. Finally, you can avoid storing the lower bit entirely in the instruction immediate and thereby widening the range of relative offsets. — Margaret Bloom, Mar 13 '23 at 15:07
It might be just as fast to read a `short` from offset `1` within a 4-byte word on some CPUs. Or it might have extra latency to set up the shifting (after external bus access, or after cache access to that word.) Semi-related: [Are there any modern CPUs where a cached byte store is actually slower than a word store?](https://stackoverflow.com/q/54217528) also has some mention of byte loads and unaligned-word loads. — Peter Cordes, Mar 13 '23 at 15:08
On x86 specifically, unaligned 16-bit accesses fully contained within an aligned 4-byte chunk are [guaranteed atomic]( [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/a/36685056)) even for uncached access. — Peter Cordes, Mar 13 '23 at 15:08
Hardware might not optimize for the special case of a 16-bit within a 32-bit, maybe just detecting the low bit set in the address and using a fallback strategy that takes one or more extra cycles. For C specifically, the alignment rules are simplistic and err on the side of being always aligned, partly so that any object can be part of an array. e.g. C doesn't let you have a single `short` at an alignment where you couldn't have `short[2]`, which would cross a 32-bit boundary. — Peter Cordes, Mar 13 '23 at 15:10
*I know 4 byte int need to be aligned at address multiple of 4 to avoid extra reads.* - Depends on the CPU. On an Intel CPU since Pentium Pro, any 4-byte load or store that's fully contained inside a 64-byte cache line (or 32-byte on older CPUs) has full performance (and is even guaranteed atomic, unlike on AMD). Modulo some cache-bank conflict effects between multiple loads in the same cycle on early Sandybridge-family which can be exacerbated by loads that span 16-byte boundaries. But yes, that statement is true for simple memory / cache access hardware. — Peter Cordes, Mar 13 '23 at 15:16
Anyway, an important part of the idea here is that an aligned half-word only has two possible shift-counts within a 32-bit word, so the hardware for MIPS `lh` (load half) can just mux between two possibilities, not three (or four if it allows crossing wider boundaries.) — Peter Cordes, Mar 13 '23 at 15:22
It is necessary only for speed. The CPU can read the data starting from any address. — i486, Mar 13 '23 at 15:25
@PeterCordes @ JonathanLeffler Could you compile your comments into an answer for each question or restructure the current answer from supercat? Thanks. — user762750, Mar 13 '23 at 15:30
@i486: This question isn't tagged x86; not all CPUs have unaligned-load instructions at all, so doing every load and store in a way that supports the possibility of being unaligned could be a huge penalty, taking multiple instructions. And for thread-safety, maybe requiring compare/branch on the address if you want aligned stores to only touch the one word you're storing to. — Peter Cordes, Mar 13 '23 at 15:34
@i486 *It is necessary only for speed. The CPU can read the data starting from any address.* Not true [even on x86 systems](https://stackoverflow.com/a/46790815/4756299) — Andrew Henle, Mar 13 '23 at 16:47
@user762750 "Why data with smaller size than CPU word size need to be aligned at multiple of its size?" --> to be clear, it does not _need_ to be aligned per the C standard. C allows it and many implementations do so. — chux - Reinstate Monica, Mar 13 '23 at 17:08
@chux-ReinstateMonica: You're saying "many" C implementations have `alignof(short) == 1`? That's surprising. Neither GCC nor clang have that; IDK if MSVC defines that behaviour when auto-vectorizing for code with misaligned `short` elements or any of the other cases where GCC and clang don't do what code with misalignment-UB hoped they would. [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) has an example and links to two blogs about other cases with different breakage mechanisms. — Peter Cordes, Mar 13 '23 at 17:50
@AndrewHenle A lot of philosophy / rocket science but I think the question is at more basic level. For that reason my comment is this. — i486, Mar 13 '23 at 20:40
@i486 At a basic level, "The CPU can read the data starting from any address" is flat-out wrong on most CPUs, and it's even wrong on X86 systems in many cases. — Andrew Henle, Mar 13 '23 at 21:18
@AndrewHenle Give me one example for x86 because I cannot guess. — i486, Mar 13 '23 at 22:26
@i486: Andrew Henle is talking about `movaps` or `movdqa` (https://www.felixcloutier.com/x86/movdqa:vmovdqa32:vmovdqa64), the 16-byte-alignment-required instruction that faulted in the link from his first comment to you, same as [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](//stackoverflow.com/q/47510783). Also for 16-byte memory source operands to legacy-SSE instructions other than `movdqu` / `movups` / `movupd`. (AVX changed the default to unaligned, so `vpaddb xmm0, xmm1, [rdi]` is an unaligned load, unlike `paddb xmm0, [rdi]`.) x86 didn't end with 486 :P — Peter Cordes, Mar 14 '23 at 00:27

score 5 · Accepted Answer · edited Mar 14 '23 at 12:20

5

Most machines are designed with memory that is addressed using a combination of a "words" address that identifies a group of two or more bytes, along with byte-select lines that indicate which bytes within a word are being accessed. When performing operations that are a word or smaller, all bytes within a word can be accessed simultaneously. Operations larger than a word will always need to be split into multiple operations, and most CPUs won't care about alignment of any chunks larger than the word size; some CPUs may be able to split word-size-or-smaller operations that would require accessing parts of two consecutive words into smaller operations, but that ability is not universal.

The standard guarantees that for any power of two, N, a multiple-of-N offset into an allocation which is suitably aligned for N-byte objects will yield an address which is suitably aligned for N-byte objects. It does not guarantee that platforms with smaller words sizes will tolerate looser alignments because:

The Standard deliberately waives jurisdiction over non-portable constructs.
Implementations which want to offer stronger guarantees are free to do, and compiler writers that want to uphold the Spirit of C will do so absent any reason to do otherwise.
Even on 8-bit platforms, there may be advantages to requiring word alignment, even though ironically I'm not aware of any implementations ever doing so in the 8-bit platforms where it would have been most useful. For example, on the Z80, the common way to load DE with a 16-bit value whose address is in HL would be:
```
 mov e,(HL)
 inc hl
 mov d,(hl)
```

but if HL was known to be even, the second instruction could be replaced by inc l, which would be two cycles faster, cutting the total time from 20 cycles to 18. Not a huge performance win, but if an application wouldn't ever use odd addresses for word-sized values, it would represent "low-hanging fruit".

edited Mar 14 '23 at 12:20

Peter Cordes

328,167
45
605
847

answered Mar 13 '23 at 15:10

supercat

77,689
9
166
211

using z80 as an example in 2023 ‍♂️ – 0___________ Mar 13 '23 at 15:15
2

@0___________: Would you prefer AVR, an 8-bit RISC microcontroller designed to be a decent compiler target for C compilers, since it was designed many years after 8080 / Z80? But that doesn't work as an example because AVR allows `Z` and `Z+1` addressing modes, not needing a separate increment: https://godbolt.org/z/4aa5WE8cP – Peter Cordes Mar 13 '23 at 15:18
Thanks for the answer. Could you restructure it to address each question separately? – user762750 Mar 13 '23 at 15:19
@PeterCordes AVRs do not have such ancient architectural bottlenecks like ancient z80 where designers we trying to save every possible transistor – 0___________ Mar 13 '23 at 15:20
Efficient code or not, it would be ridiculous to force alignment upon these low-end parts, because about the only good thing about the old lousy 8 (and many 16) bitters is that you don't have to care about alignment. For micro-optimizations, many 8 bitters would rather care to use the "zero page" 8 bit address range from 0-255 which could shave off a few ticks. – Lundin Mar 13 '23 at 15:33
As for Z80 vs AVR is kind of like saying that dinosaurs from the cretaceous period are much more modern than those from the jurassic period :) – Lundin Mar 13 '23 at 15:40
@Lundin: Another good thing about those parts is that 8-bit systems cost less than 32-bit systems. Although the price difference has dropped to the point that it isn't *usually* significant, I think 8-bit parts can be had for under $0.10 but I don't think any 32-bit parts can. – supercat Mar 13 '23 at 15:46
@Lundin: I really doubt there have been very many situations not involving legacy code where a decision to use an 8-bit system rather than a 16-bit or 32-bit one was motivated by the fact that the latter would impose coarser alignment requirements. – supercat Mar 13 '23 at 15:56
"Another good thing about those parts is that 8-bit systems cost less than 32-bit systems" That stopped being an argument like year 2010 or so. You can get a Cortex M0 for $1 and that should be cheap enough for anyone except those doing disgusting mass production of consumer electronics products. Now if we instead put a value on these parts like "execution time/microamperes needed to run algorithm x" per dollar, then of course 8 bitters are ridiculously expensive and current consuming. Why they like to speak about "MIPS" which should be MIIPS... Million Inefficient Instructions per sec. – Lundin Mar 13 '23 at 16:04
As for 8/16 vs 32, I don't think alignment as such is a big consideration, other than the 32 bitter generating slightly larger executables and probably chewing up more data memory as well. C code safety is a real concern why 8/16 bitters are avoided though, because of all the nightmarish implicit integer promotions of the C language, as well as the nasty 16 bit `int` type causing problems. – Lundin Mar 13 '23 at 16:06
@0___________: The first version of the C89 Standard was completed in 1989, and many aspects of the Standard today are inherited from C89. The 8080, with or without the Z80 extensions, was probably the most popular 8-bit platform targeted by compilers (I'd guess most compilers would have been primarily designed around the the 8080, but included options to exploit some Z80 instructions if available), and would likely have been the 8-bit platform most familiar to Committee members. – supercat Mar 13 '23 at 16:08
@Lundin: I would expect that for many tasks an 8-bit or 16-bit design could manage lower average current than a 32-bit one, if the core was designed with as much thought toward power reduction as is applied to 32-bit cores. I think the advantages of 16-bit over 32-bit would likely be too slight to justify the effort from a marketing perspective, and the marginal advantages of 8-bit over 16-bit would be even smaller, but I think the only power-saving advantage of 32-bit MPU over 16-bit would be based upon the larger amount of invested effort. – supercat Mar 13 '23 at 16:18
@supercat Depends. First we'd have to assume either part gets clocked with the same clock, which is by far the most current consuming part. If the task is to "stay asleep for x ms, then wake up and do some intricate calculation, then go to sleep again", then the 32 bitter will likely be around 100 times less current consuming. Anyway, this is getting widely off-topc. – Lundin Mar 13 '23 at 16:28
@supercat cheapest Cortex-M0 from China (Puya) 24Mhz 32kFLASH and 3kRAM is about $0.08. 8bit ones are usually much more expensive :) – 0___________ Mar 13 '23 at 16:39

Why data with smaller size than CPU word size need to be aligned at multiple of its size?

1 Answers1