Is a double aligned to an 8 bytes boundary because of the FPU or because of the cache?

Question

I am trying to understand why a double is aligned to an 8 bytes boundary and not just on a 4 bytes boundary. In this article it says:

When memory reading is efficient in reading 4 bytes at a time on 32 bit machine, why should a double type be aligned on 8 byte boundary?

It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU). Any floating point operation in the code will be translated into FPU instructions. The main processor is nothing to do with floating point execution. All this will be done behind the scenes.

As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.

The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary. I am assuming (I don’t have concrete information) in case of FPU operations, data fetch might be different, I mean the data bus, since it goes to FPU. Hence, the address decoding will be different for double types (which is expected to be on 8 byte boundary). It means, the address decoding circuits of floating point unit will not have last 3 pins.

While in this SO question it says:

The reason to align a data value of size 2^N on a boundary of 2^N is to avoid the possibility that the value will be split across a cache line boundary.

The x86-32 processor can fetch a double from any word boundary (8 byte aligned or not) in at most two, 32-bit memory reads. But if the value is split across a cache line boundary, then the time to fetch the 2nd word may be quite long because of the need to fetch a 2nd cache line from memory. This produces poor processor performance unnecessarily. (As a practical matter, the current processors don't fetch 32-bits from the memory at a time; they tend to fetch much bigger values on much wider busses to enable really high data bandwidths; the actual time to fetch both words if they are in the same cache line, and already cached, may be just 1 clock).

A free consequence of this alignment scheme is that such values also do not cross page boundaries. This avoids the possibility of a page fault in the middle of an data fetch.

So, you should align doubles on 8 byte boundaries for performance reasons. And the compilers know this and just do it for you.

So which one is the correct answer? Is it both?

Any source that mentions a co-processor is horribly outdated. The FPU has been integrated since before I was born. Crossing cache line boundaries was disastrous on Core2 (to the point that the most horrible hacks to avoid them were worth it), and is still not optimal to this day. — harold, Jan 10 '15 at 17:56

score 1 · Accepted Answer · answered Jan 10 '15 at 18:00

It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU).

So, first of all, the article is somewhat wrong. There's not really an FPU in processors anymore, as the arithmetic instructions are basically handled in the same instruction pipelines etc.

The main processor is nothing to do with floating point execution.

This is 2015, we're not talking of Intel 486, so this is simply wrong.

As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.

This was never true, to my knowledge; there are instructions that work on single precision floats, and instructions that work on double precision.

The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary.

That's simply not true. There are some instructions that can only work with specially aligned memory, and some are faster, but that usually is up to their specification or their respective implementation. Things like cycles that a specific operation needs change between processor generations!

So, the SO answer is correct. Trust your compiler. If you want to have aligned memory (ie. for arrays of floats that you want your compiler to use SIMD instructions on etc), then there's things like posix_memalign (under unixes, of course, but I could imagine the posix layer in WindowsNT and later implements that, too), that can give you nicely aligned memory.

Note: it's possible to load a SIMD register from both an aligned and unaligned address. If you give it as aligned, and it isn't, the CPU will generate an hardware exception. Loading from an unaligned address is a little slower (but still fast). Anyways, the compiler knows this; this would be useful if you were handling that on your own (intrinsics, asm, ecc). — edmz, Jan 10 '15 at 19:39
Note (especially @black): Have a look at [VOLK](https://github.com/gnuradio/gnuradio/tree/master/volk/kernels/volk) (the vector optimized library of kernels, which implements several basic signal processing algorithms "generic" (ie. only letting the compiler do its magic) and in SSE/SSE2/NEON/whatever versions; that might be interesting when trying to understand the differences between aligned, unaligned. — Marcus Müller, Jan 11 '15 at 15:20
@edmz: Since Nehalem and Bulldozer (over a decade ago), `movups` runs at full speed if the address happens to be aligned at runtime. (And in many cases, as long as it doesn't split across a cache-line boundary even if it is unaligned). The downsides for SIMD on unaligned data on modern x86 are that legacy SSE instructions can't use a memory source operand like `addps xmm0, [rdi]`, they need a separate movups, unless you're using AVX VEX encodings (`vaddps xmm1, xmm0. [rdi]`). And that there will be cache-line splits and even 4k page split if you're reading multiple contiguous vectors. — Peter Cordes, Nov 30 '21 at 09:51
Related: [How can I accurately benchmark unaligned access speed on x86\_64?](https://stackoverflow.com/a/45129784) — Peter Cordes, Nov 30 '21 at 10:00

score 0 · Answer 2 · answered Jan 10 '15 at 18:03

In general, memory alignment issues are mostly hidden by the memory units - the execution units would receive the data properly rotated and with the right size (the same question may apply to integer types as well).

Alignment therefore mostly relates to the ability to cache this data without fear of having to fetch it in pieces (split fetches) a tricky business which raises all sort of coherency and atomicity problems.

This of course may change if some architecture wants to save on the rotation logic and forces you to align some of your data accordingly, but in general this is a simpler problem to solve, so restricting the architecture for this hardware consideration is a little pointless (at least these days).

Is a double aligned to an 8 bytes boundary because of the FPU or because of the cache?

2 Answers2