Why 16-byte alignment for `long double`?

Question

64 bit architecture like x86-64 have word size of 64bits. In this case, if a memory access crosses over the word boundary, then it will require double the time to access data. So alignment is required. - This is what I know. Correct me if I am wrong.

Now, GCC uses 16 byte alignment (msvc atleast uses 8 byte alignment) for long double whose non-padding size is 10 bytes. But anyways, with 8 byte alignment it requires 2 read cycles and it is the same case with 16 byte alignment. So why stricter 16 byte alignment? What is the purpose of alignment other than that I mentioned above?

Also, in fact, since the non-padding part of long double (the 80-bit x87 extended FP) is 10 bytes, actually 4 byte alignment is sufficient for that. In this case also, it can read data within 2 read cycles (either 4-6 or 8-2). So, also explain where this assumption has gone wrong.

(The actual sizeof(long double) is 12 in the i386 System V ABI, 16 in x86-64 System V. Multiples of their respective alignof() of 4 and 16)

x86-64 doesn't have a "word size", that's not a meaningful concept for x86, which can load/store any power-of-2 width from 1 byte to 32 bytes (or 64 with AVX-512 capable CPUs), with near-equal performance as long as the load doesn't cross a 64-byte *cache line* boundary. — Peter Cordes, Jul 10 '21 at 18:50
Probably because older CPUs could only do 16-byte load/store efficiently when it was naturally aligned. [Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?](https://stackoverflow.com/q/49391001). 80-bit x87 is slow anyway, so it's a somewhat questionable decision to use that much extra space in arrays, although `fld m80` does decode into a 2-byte and an 8-byte load, so 8 or 16 byte alignment are both sufficient to avoid cache-line splits in either of the halves. But only if the size is 16 bytes, so you might as well make the align match the size for SSE copying. — Peter Cordes, Jul 10 '21 at 18:55
Intel Optimization Manual recommends a 16 byte alignment for 80bit `long double`, but it does not explain why or what the impact is. My quick experiments showed no impact of (mis)alignment, only of crossing cache line boundaries, as expected. — harold, Jul 10 '21 at 19:06
Re: the concept of a "word": see [Weird data sizes?](https://stackoverflow.com/q/32548409) and [Does Word length == number of bits transferred between memory and CPU per access?](https://stackoverflow.com/a/36996828), and my longish answer at [How does the CPU reads a double value?](https://stackoverflow.com/a/45900274) re: how CPUs access memory *through cache*. Also [What's the actual effect of successful unaligned accesses on x86?](https://stackoverflow.com/q/12491578) / [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/a/45129784) — Peter Cordes, Jul 10 '21 at 21:06
Re: x87 performance on modern CPUs (and AMD K8, which was the relevant ISA when the x86-64 System V ABI was being designed), see [Did any compiler fully use Intel x87 80-bit floating point?](https://retrocomputing.stackexchange.com/a/9760) on retrocomputing.SE — Peter Cordes, Jul 10 '21 at 21:09
Aligning each object to a multiple of its size is the easiest way to ensure that no object crosses a cache line boundary. — prl, Jul 10 '21 at 22:12

Why 16-byte alignment for `long double`?

0 Answers0

Linked