Confused about data alignment

Question

I'm trying to get my head around why data alignment/padding is necessary. From wikipedia:

"When a modern computer reads from or writes to a memory address, it will do this in word sized chunks"

Yet I can use x86's movb instruction to clearly move data to and from at the byte resolution. What am I missing here?

score 1 · Answer 1 · edited Jan 22 '18 at 23:46

This is a common misconception. Byte accesses do not require a read-modify-write of the containing 32 or 64-bit chunk of that cache line (or memory for uncached access). See Can modern x86 hardware not store a single byte to memory?.

A single-byte access is automatically naturally aligned. This means aligned to the width of the access, so it doesn't cross any boundaries wider than itself.

A word load or store is still one single transaction, unless it's split across a cache-line boundary (in which case the CPU internally has to access the relevant part of both cache lines). So that quote is only accurate for machine-word sized accesses. (Note that word in Intel terminology is 16 bits, not the register or bus width of modern x86 CPUs. That's why I said "machine word" in the previous sentence.)

Padding is therefore added to structures in C not because byte-access is inefficient for byte-sized fields, but rather so that objects wider than one byte are naturally aligned (e.g. an int following a char in a struct).

Unlike byte access, some relatively common platforms do or did not support direct unaligned access, and on those that do, unaligned access may be less efficient, especially when crossing a cache line. C compilers treat structs as having an alignment requirement of whatever their most-aligned member is. e.g. a struct of int, char, and double would have 64-bit alignment because of the double members, so padding to align the double relative to the struct will also align it in an absolute sense, so struct members always maintain their natural alignment.

Even on a hypothetical platform with no unaligned access penalties, having unaligned objects would greatly complicate the implementation of memory models that rely on atomic reads and writes, since many platforms guarantee atomicity for those operations only if they are aligned.

Modern CPUs transfer data in cache-line sized chunks, not just 32 or 64-bit words. Unless you're accessing an uncacheable memory region (e.g. memory-mapped I/O in a device driver), in which case you'll get actually byte, 16-bit, 32-bit, or 64-bit accesses going over the external bus.

As long as you don't cross a 64-bit boundary, there's no penalty for unaligned access on modern x86 CPUs. (And on Intel specifically, no penalty for unaligned load/store unless you cross a cache-line boundary).

See also How can I accurately benchmark unaligned access speed on x86_64, and performance-tuning links in the x86 tag wiki.

Right, so then why is padding necessary? I.e., what is the real text that should appear in Wikipedia? — BeeOnRope, Jan 22 '18 at 08:36
@BeeOnRope: I had one sentence at the end of a paragraph in there, but you're right I should have expanded on it. Done. As for Wikipedia, I guess just padding to avoid misalignment of word elements or objects relative to word boundaries? I didn't look at the rest of the context, and it's a lot of work trying to figure out how to phrase a generic explanation. — Peter Cordes, Jan 22 '18 at 11:14
Yes, basically why is padding needed at all? Because aligned access is (a) sometimes the only directly supported access and otherwise usually (b) more efficient. Note also it's only on somewhat recent Intel that only cache line splits have hurt - before various other misaligned accesses were also slower. Another reason for alignment is for independence of adjacent elements across threads. — BeeOnRope, Jan 22 '18 at 18:09
@BeeOnRope: I can't remember what the costs are like for unaligned integer access on Core2 and earlier. I know `movdqu` was expensive, but it was expensive even with aligned addresses. Please make an edit on this answer if you have a good idea for what to add to answer that part of the question; I mostly answered to correct the misconception about byte loads/stores. — Peter Cordes, Jan 22 '18 at 18:33
For older I usually [look here](http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/), which shows it graphically. There you can see that even back to Core 2 it was still only cache-line crossings that mattered on Intel (barring false-positive store-forwarding issues), but on AMD other boundaries matter (and still seem to matter on Ryzen although it isn't show there). It's possible I am mistaken and on Intel unaligned access was never slow? Seems unlikely though since you find so much x86-specific advice to avoid it, but rarely saying "the only real problem is cache lines"... — BeeOnRope, Jan 22 '18 at 23:39
I did my best to answer that part of the question: you had already answered the "why padding results in natural alignment" part, I just tried to add the "why you'd care about natural alignment" which I think completes the point. BTW, I edited the [problematic article](https://en.wikipedia.org/wiki/Data_structure_alignment) which lead the confusion. Later parts of the article were OK, but the intro was just off. — BeeOnRope, Jan 22 '18 at 23:58

score -2 · Accepted Answer · answered May 15 '13 at 01:19

-2

Word aligned memory access is much faster than byte aligned one. That makes it much faster to transfer large blocks of data. You can address a single byte, but likely a word will be read from memory and internally reduced to a byte. That makes the access slower.

answered May 15 '13 at 01:19

rslite

81,705
4
44
47

Ah ok - so it's not a hardware restriction that forces computers to r/w in word chunks - its an optimization issue. That help a lot. Thanks! – gone May 15 '13 at 01:35
No, a single-byte load is not slower than a word load. And doesn't have to access the containing word. You can prove this by benchmarking an asm loop that copies `[buf + 0]` to `[buf + 1]`: a byte store next to a byte load. If the load had to wait for the store, the loop would bottleneck on store-forwarding latency (~5 cycles on Haswell), because the copy makes it loop-carried. But the byte load is independent of the byte store within the same word, so the loop only bottlenecks on one store per clock, not one per 5 clocks. – Peter Cordes Jan 22 '18 at 07:57
The thing that's potentially slow is misaligned word access, not byte accesses. – Peter Cordes Jan 22 '18 at 07:58

Confused about data alignment

2 Answers2