0

Most processor architectures prefer natural alignment as the default alignment requirement but I think that processor-word alignment is a more efficient alignment requirement that saves memory without any performance overhead over natural alignment.

For example, a double has an alignment of 8 according to natural alignment but on 32-bit processors, it would have no performance overhead if double had an alignment of 4 and it would have saved memory. This source#3.6.4 states that double has an alignment of 8 on 32-bit processors:

Align 64-bit data so that its base address is a multiple of eight.

Similar examples can be seen in 64-bit processors, 16 byte-sized data type(int128) has an alignment of 16 whereas it could have been beneficial to keep the alignment equal to the size of a processor word(i.e. 8 bytes long in 64-bit processors).

My guess is that this standard of natural alignment was created because when data was read directly from the wire, the machines could default to natural alignment and not have to deal with different alignments of the same data type according to the CPU architecture of the sender of the data.

When all the fields of a data structure are being stored in a single CPU word, they still have padding inside due to natural alignment whereas I do not think that padding is needed when all the fields are stored in a single CPU word because any field of the structure would take the same amount of byte shifts to access it regardless of where in the CPU word it is stored(please correct me on this if I am wrong).

For example, consider this struct:

struct example {
   char i; // 1 byte
   // 1 byte padding
   short j; // 2 bytes
   int k; // 4 bytes
   char l; // 1 byte
   // 3 bytes trailing padding
} foo;

The padding between foo.i and foo.j is not needed in my opinion because foo.j would still need 6-byte shifts to access.

To summarize my question, I want to know what are the benefits of natural alignment over the processor-word-based alignment.

I also want to know whether inserting padding in between the CPU word where all your data is stored is any better than storing those fields without any padding. Also, does the position of fields in the same CPU word make any difference?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

1 Answers1

2

All 64-bit CPUs have native 4-byte loads, and all except Alpha have 2-byte and byte loads. On some CPUs those narrow loads require or are more efficient with naturally-aligned data.

I think you're assuming that narrow data is loaded by actually loading the containing 8-byte chunk and manually(?) extracting with a bit-shift or byte-shift, rather than logic that only has to select one of two 4-byte chunks of an 8-byte fetch from cache, based on only one address bit.

CPUs could maybe have been designed differently, to allow 2-byte loads at any offset within an 8-byte chunk, including odd addresses, but that would require a few more gates in load ports, and longer gate-delays. And maybe even worse for stores, in terms of masking the write into cache?

Also, the rule for what's allowed or not becomes more complicated; slightly harder for hardware to check on a CPU that enforces alignment requirements. (A CPU that allows unaligned loads already needs about that much logic to detect when a load/store is split across cache lines, or across chunks of the same line if it can only fetch in small chunks.)

But more importantly, maybe harder for human programmers and compilers to efficiently take advantage of. Maybe that's not a big problem, perhaps this is a plausible path that computer architecture could have taken, in which case it's something we'd all be used to by now.

(Intel/AMD CPUs actually do have rules like this for atomicity guarantees; any power-of-2 access to cacheable memory contained within an 8-byte chunk is guaranteed atomic. Or even to uncacheable memory for a 16-bit access within a 32-bit dword. But since there are some ISAs that simply require alignment, and because C alignof / alignas doesn't have a way to describe that, languages like C pick the lowest common denominator and require natural alignment for _Atomic types.)


BTW, many SIMD instruction sets have more efficient aligned loads than unaligned, especially 10 or 20 years ago, so the relevant width is the SIMD vector width if you want to copy a whole 16-byte struct with one instruction, not the width of a general-purpose integer reg. e.g. x86 movups vs. movaps xmm, mem (16-byte alignment required).

But normally you don't over-align structs that contain smaller members.


Alignment to more than the CPU bitness often does matter

For example, a double has an alignment of 8 according to natural alignment but on 32-bit processors, it would have no performance overhead if double had an alignment of 4 and it would have saved memory.

Wrong on modern CPUs. The FPU on Intel CPUs since P5 Pentium could do 8-byte accesses to cache, regardless of the fact that the integer register width was only 32-bit. So could the MMX unit.

Modern 32-bit ARM CPUs are similar, with aligned 8-byte FPU loads/stores being preferred.

On Pentium 4, SIMD load/store could access 16 bytes in a single operation. (P6 family split SSE/SSE2 operations into 8-byte halves until Core2, the first P6-family to support x86-64.)

32-bit only describes address and/or integer register width, not max cache-access width by the FPU or load-pair / store-pair instructions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847