How memory aligment and access granularity work in assembly?

Question

I'm trying to understand how CPU memory alignment and CPU memory access granularity works, but I'm little bit confused since I'cant find what's is the access granularity of my CPU and how both interact together to affect performance. so having this assembly code running x86-64 processor:

start0:
  .byte 0   # at address 0x0
start1:
  .byte 1   # at address 0x1
start2:     # at address 0x2
  .quad 2
start3:     # at address 0x10 for example
  .quad 

movb start0, %al    # (1) aligned
movb start1, %al    # (2) unaligned
movq start2, %rax   # (3) unaligned
movq start3, %rax   # (4) aligned

Does this mean that (1) would cause the CPU to read only 1 byte from memory or the CPU will read a 64-bit of memory and shift the unneeded part.

(2) would cause CPU to read at address 0x0, but how much will it read is it 64-bit or something else ?.

(3) would this cause the CPU to perform two memory reads one at address 0x00 and other at address 0x8 and again manipulate them to have right value?.

(4) will read 8 bytes from address 0x10 as normal right ?

also how can I know the access granularity of my CPU running CPUID seems to have no results. I already know the alignment of my CPU which is 8-bytes alignment.

No, try it yourself with that code in an executable and use a debugger. Without `.p2align 3` before `start3`, it will be misaligned. 2 + 8 = 0xa, not 0x10. The assembler just emits bytes into the output according to your directive, and `.quad` doesn't imply alignment. — Peter Cordes, Jan 18 '21 at 11:44
“x86-64 processor” is not specific enough to give a precise answer to your questions. For modern x86 processors (but I believe not all x86-64 processors), all accesses that do not cross a cache-line happen at no performance penalty. So unless your variable crosses a cache-line, alignment isn't super important. It is still a good idea to align your variables though. — fuz, Jan 18 '21 at 11:45
As for your other question: no, a 1-byte load logically accesses just that 1 byte. e.g. if you were loading from an MMIO address, the device would only see that 1-byte access, not a qword access. For cacheable RAM, what the CPU does internally isn't visible except via performance, but is logically equivalent. (And in modern x86 CPUs, we can more or less prove that even the hardware is doing single-byte accesses because byte stores don't cause store-forwarding stalls for adjacent loads, and other factors.) — Peter Cordes, Jan 18 '21 at 11:48

Peter Cordes · Accepted Answer · 2021-01-18T12:10:35.087

No, try it yourself with that code in an executable and use a debugger. Without .p2align 3 before start3, it will be misaligned. 2 + 8 = 0xa, not 0x10. The assembler just emits bytes into the output according to your directive, and .quad doesn't imply alignment.

As for the main part of question: no, a 1-byte load logically accesses just that 1 byte. e.g. if you were loading from an MMIO address, the device would only see that 1-byte access, not a qword access.

I already know the alignment of my CPU which is 8-bytes alignment.

There is no sense in which this is true, except for performance advantages of 8-byte loads. x86-64 can do 1, 2, 4, 8, or 16-byte loads. (And with AVX or AVX-512, 32 or 64-byte loads as well.) But it allows unaligned loads for any of these sizes. Some forms of 16-byte loads (like SSE memory operands) require 16-byte alignment, but nothing below 16 does. (There is an Alignment Check (AC) flag you can set in EFLAGS, but it's not very usable most of the time because compilers and libc implementations of memcpy freely use unaligned accesses.) Even microarchitecturally, modern x86 hardware truly does efficient unaligned accesses to its caches.

For cacheable RAM, what the CPU does internally isn't visible except via performance, but is logically equivalent.

In modern x86 CPUs, it's actually a whole 64-byte cache line that's loaded from RAM. But we can more or less prove that even the hardware is doing single-byte accesses to cache because byte stores don't cause store-forwarding stalls for adjacent loads, and other factors. See Can modern x86 hardware not store a single byte to memory?

Note that some non-x86 CPUs do have slower cache access for single-byte store or even load. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). x86 CPUs are designed for efficient unaligned and single-byte accesses so software keeps using them. But on ISAs that historically haven't supported unaligned accesses, like MIPS or to some degree ARM, software usually avoids unaligned access so there's less benefit to hardware spending a lot of transistors to make it fast.

(Also, current x86 designs have targeted a use-case where spending more transistors and power for minor speed gains is desirable, while most ARM designs haven't. Also the factor of x86 CPUs trying to speed up things that existing binaries do, with less hope of getting people to recompile or redesign software to avoid things like unaligned access. All that said, modern ARM / AArch64 has reasonable unaligned / byte access I think, but not zero-penalty the way modern x86 does for any load that doesn't span a cache-line boundary.)

Footnote 1: Note that this applies to asm; if you're writing in C, the language / ABI rules apply until the compiler has actually nailed down the asm. See Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? for an example where misaligned C pointers violate the compiler's assumptions in a way that causes a problem when compiling for x86-64.

thank's alot but what's meant by spanning cache-line boundary — KMG, Jan 18 '21 at 12:01
@KhaledGaber: e.g. a 4-byte load from `0x...3f`, so one byte comes from one cache line, the other 3 come from the next 64-byte-aligned cache line. — Peter Cordes, Jan 18 '21 at 12:03

How memory aligment and access granularity work in assembly?

1 Answers1