0

I read somewhere that when programming on a 32 bit x86 processor in assembly, that it is more efficient to write and read memory in multiples of 4 bytes despite the fact that you can in fact work in units as low as 1 byte. Is this true and why is it true? What is the underlying design that causes it to work like that?

Jason Mills
  • 585
  • 2
  • 6
  • 20
  • 1
    If you need to access all 4 bytes, *and if the address is 4-byte-aligned*, then it's probably ~4x faster, since all data paths on recent 32-bit CPUs are at least 32 bits wide. The alignment constraint is there because the underlying memory and cache hardware access memory in larger chunks than 1 byte at a time, so if you ask for a non-aligned DWORD, the hardware has to read 2 adjacent DWORDs and assemble the one you asked for with shifts. (Though I've also heard that the penalty for unaligned memory access is lower on the latest CPUs.) – j_random_hacker Jun 16 '14 at 03:47
  • That's probably an oversimplification, but that's the general gist: if possible, align all memory accesses to the largest feasible power of 2 and use the largest element size available. Different CPUs and motherboards probably vary widely in exactly how they handle non-aligned reads and writes; e.g. it might well be that high-end MBs can detect sequences of non-aligned accesses to RAM and intelligently turn them into a pair of non-aligned accesses at each end and a sequence of aligned accesses in-between. – j_random_hacker Jun 16 '14 at 03:51
  • Slightly off topic question, if I wanted to access a single byte, would the data bus return 4 bytes and ignore the unwanted three, or can it actually access a single byte by itself. – Jason Mills Jun 16 '14 at 03:56
  • I don't know the answer to that sorry, but I would guess that accessing a single byte only reads that byte if it's already in cache. If it's not in cache, then it will almost certainly read the entire cache line (usually 32 or 64 bytes) into cache from RAM. – j_random_hacker Jun 16 '14 at 03:59
  • What do you mean by efficient? Time per byte or time per read (however many bytes it may be)? – harold Jun 16 '14 at 06:49
  • @j_random_hacker and Jason: loading/storing a single byte doesn't access or disturb neighbouring bytes. (See https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory). e.g. a byte store to `[rdi]` and a byte load from `[rdi+1]` doesn't create a false dependency or failed store-forwarding, even if they're inside the same aligned 4 or 8-byte chunk. But yes, for cacheable memory, a byte load still loads the whole cache line (or requests it from another cache, if it was modified by another CPU core). – Peter Cordes Nov 23 '17 at 07:53

1 Answers1

0

The only real performance penalty is reading multi-byte values from non-aligned locations. Everything else is gravy especially given a likely cache hit on 2nd and subsequent aligned bytes.