Aligning Data in Assembly

Question

I'm reading the Kip Irvine book, "Assembly Language for x86 Processors," and in the section discussing the ALIGN directive, the author mentions that the "CPU processes data stored at even-numbered addresses more quickly than those at odd-numbered addresses." My question, then, is multifacted:

Why does the CPU process data at even-numbered addresses more quickly?
How much of an effect would even-numbered vs. odd-numbered addresses have?
Will a data segment generally start on an even numbered address or does it depend?

In the section detailing the NOP instruction, the author mentions that x86 processors load data from even doubleword addresses more quickly. Then, would an efficiency hierarchy be: addresses that are multiples of 8 (even doublewords, if I understand correctly) > addresses that are multiples of 2 > addresses that are odd?

Note that code alignment is different from data alignment. Code fetch goes through different pathways than data load/store instructions. Code alignment sometimes matters for branch targets, but instructions in the middle of a straight-line block of code aren't fetched separately. But anyway, related: [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/a/45129784). Historically yes, CPUs like 8086 were faster with aligned data. These days only crossing a cache-line boundary matters on Intel CPUs, or sometimes 16 or 32 byte boundary on AMD. — Peter Cordes, Apr 13 '20 at 00:45
Also, 8 bytes is a qword in x86 terminology. An x86 "word" is 2 bytes. — Peter Cordes, Apr 13 '20 at 00:47
That gives me a new direction to research. Thank you. It looks like those cache-lines on Intel CPUs are 64B. Regarding even doublewords, a doubleword being 4 bytes, every other doubleword would be 8 bytes apart or am I misunderstanding? — jfogus, Apr 13 '20 at 13:40
In an array of `uint32_t arr[]`, the 4 byte dword elements are 4 bytes apart. So yes, every *other* element is 8 bytes. Naturally-aligned data is always the best case, but "best case" performance can include misaligned data that doesn't cross cache line boundaries, depending on the CPU. See also [Can modern x86 hardware not store a single byte to memory?](https://stackoverflow.com/q/46721075) re: unaligned and single-byte load/store performance (and atomicity) on x86 vs. many non-x86 CPUs. — Peter Cordes, Apr 13 '20 at 14:15

Aligning Data in Assembly

0 Answers0