No, try it yourself with that code in an executable and use a debugger. Without .p2align 3
before start3
, it will be misaligned. 2 + 8 = 0xa, not 0x10. The assembler just emits bytes into the output according to your directive, and .quad
doesn't imply alignment.
As for the main part of question: no, a 1-byte load logically accesses just that 1 byte. e.g. if you were loading from an MMIO address, the device would only see that 1-byte access, not a qword access.
I already know the alignment of my CPU which is 8-bytes alignment.
There is no sense in which this is true, except for performance advantages of 8-byte loads. x86-64 can do 1, 2, 4, 8, or 16-byte loads. (And with AVX or AVX-512, 32 or 64-byte loads as well.) But it allows unaligned loads for any of these sizes. Some forms of 16-byte loads (like SSE memory operands) require 16-byte alignment, but nothing below 16 does. (There is an Alignment Check (AC) flag you can set in EFLAGS, but it's not very usable most of the time because compilers and libc implementations of memcpy freely use unaligned accesses.) Even microarchitecturally, modern x86 hardware truly does efficient unaligned accesses to its caches.
For cacheable RAM, what the CPU does internally isn't visible except via performance, but is logically equivalent.
In modern x86 CPUs, it's actually a whole 64-byte cache line that's loaded from RAM. But we can more or less prove that even the hardware is doing single-byte accesses to cache because byte stores don't cause store-forwarding stalls for adjacent loads, and other factors. See Can modern x86 hardware not store a single byte to memory?
Note that some non-x86 CPUs do have slower cache access for single-byte store or even load. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). x86 CPUs are designed for efficient unaligned and single-byte accesses so software keeps using them. But on ISAs that historically haven't supported unaligned accesses, like MIPS or to some degree ARM, software usually avoids unaligned access so there's less benefit to hardware spending a lot of transistors to make it fast.
(Also, current x86 designs have targeted a use-case where spending more transistors and power for minor speed gains is desirable, while most ARM designs haven't. Also the factor of x86 CPUs trying to speed up things that existing binaries do, with less hope of getting people to recompile or redesign software to avoid things like unaligned access. All that said, modern ARM / AArch64 has reasonable unaligned / byte access I think, but not zero-penalty the way modern x86 does for any load that doesn't span a cache-line boundary.)
Footnote 1: Note that this applies to asm; if you're writing in C, the language / ABI rules apply until the compiler has actually nailed down the asm. See Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? for an example where misaligned C pointers violate the compiler's assumptions in a way that causes a problem when compiling for x86-64.