This is a common misconception. Byte accesses do not require a read-modify-write of the containing 32 or 64-bit chunk of that cache line (or memory for uncached access). See Can modern x86 hardware not store a single byte to memory?.
A single-byte access is automatically naturally aligned. This means aligned to the width of the access, so it doesn't cross any boundaries wider than itself.
A word load or store is still one single transaction, unless it's split across a cache-line boundary (in which case the CPU internally has to access the relevant part of both cache lines). So that quote is only accurate for machine-word sized accesses. (Note that word
in Intel terminology is 16 bits, not the register or bus width of modern x86 CPUs. That's why I said "machine word" in the previous sentence.)
Padding is therefore added to structures in C not because byte-access is inefficient for byte-sized fields, but rather so that objects wider than one byte are naturally aligned (e.g. an int
following a char
in a struct).
Unlike byte access, some relatively common platforms do or did not support direct unaligned access, and on those that do, unaligned access may be less efficient, especially when crossing a cache line. C compilers treat structs as having an alignment requirement of whatever their most-aligned member is. e.g. a struct of int
, char
, and double
would have 64-bit alignment because of the double
members, so padding to align the double
relative to the struct will also align it in an absolute sense, so struct members always maintain their natural alignment.
Even on a hypothetical platform with no unaligned access penalties, having unaligned objects would greatly complicate the implementation of memory models that rely on atomic reads and writes, since many platforms guarantee atomicity for those operations only if they are aligned.
Modern CPUs transfer data in cache-line sized chunks, not just 32 or 64-bit words. Unless you're accessing an uncacheable memory region (e.g. memory-mapped I/O in a device driver), in which case you'll get actually byte, 16-bit, 32-bit, or 64-bit accesses going over the external bus.
As long as you don't cross a 64-bit boundary, there's no penalty for unaligned access on modern x86 CPUs. (And on Intel specifically, no penalty for unaligned load/store unless you cross a cache-line boundary).
See also How can I accurately benchmark unaligned access speed on x86_64, and performance-tuning links in the x86 tag wiki.