Can anybody provide an example of an instruction that reads a double value?
MOVSD xmm0, [rsi]
loads 8 bytes and zeros the upper half of xmm0.
For legacy x87, there's fld qword [rsi]
. And of course you can use a memory operand to an ALU instruction, like addsd xmm0, [rsi]
.
Or with AVX, there's stuff like vbroadcastsd ymm0, [rsi]
, or vaddsd xmm1, xmm0, [rsi]
.
All of these decode a single uop on all modern x86 CPUs, and do a single access to cache.
I thought that the max value an x86 CPU can read is 4 bytes!
huh? 8-byte x87 loads have been supported since 8086. And in 64-bit mode, mov rax, [rdi]
or pop rax
are both 8-byte loads.
With AVX, you can do vmovups ymm0, [rsi + rdx]
(even in 32-bit mode) to do a 32-byte load. Or with AVX512, vmovups zmm0
for a 64-byte load or store.
Cache lines are 64-bytes on modern CPUs, so memory is (logically) copied around in 64-byte chunks inside the CPU and between cores. Intel CPUs use a 32-byte bus between cores (and between L2 and L3 on the way to memory).
See the x86 tag wiki for lots of good links to manuals and (accurate) documents / guides / articles. If something on a site like TutorialsPoint or GeeksForGeeks looks confusing or doesn't match what you're reading elsewhere, there's a good chance it's just wrong. Without a voting mechanism like SO has, the inaccurate content doesn't get weeded out.
Let's just set the record straight about http://www.geeksforgeeks.org/structure-member-alignment-padding-and-data-packing/. It is full of wrong information, and most of the stuff it says about hardware might have approximately right in 1995, but isn't now. A lot of the logic / reasoning was wrong even then.
It doesn't even contain the word "cache"! Talking about access to RAM and "banks" is total nonsense for modern x86 CPUs. The same concept sort of applies for a banked cache, so alignment boundaries like 16B or 32B matter for some CPUs even though cache lines are 64B on all x86 CPUs (after Pentium III or so).
Modern x86 has very good unaligned access support in general, especially for 8B and narrower, but there can be a penalty for crossing a 16B or especially 32B boundary on AMD CPUs including Ryzen.
on latest processors we are getting size of struct_c
as 16 bytes. [...]
On older processors (AMD Athlon X2) using same set of tools (GCC 4.7) I got struct_c
size as 24 bytes. The size depends on how memory banking organized at the hardware level.
This is obviously nonsense. struct
layout has to be the same for all compilers targeting the same ABI, regardless of what -march=pentium3
or -mtune=znver1
setting is used, or what hardware you compile on, so you can link with a library that passes (pointers to) struct
types from your code into library functions or vice versa. An obvious example being the stat(const char *pathname, struct stat *statbuf)
system call, where you pass a pointer and the kernel writes fields in the struct. If your code didn't agree with the kernel about which bytes in memory represented which C struct
members, your code wouldn't work. Specifying the layout / alignment rules (and the calling convention) is a major part of what an ABI is.
It's very likely that the "newer" test was targeting the 32-bit i386 System V psABI, while the "older" test was compiling 64-bit code for the x86-64 System V psABI (or Windows 32 or 64 bit, which both have 24 byte structc
with MSVC CL19).
typedef struct structc_tag {
char c;
double d;
int s;
} structc_t;
int sc = sizeof(structc_t);
#include <stddef.h>
int alignof_double = alignof(double);
int c_offset_d = offsetof(structc_t, d);
Compiler output for clang -m32
(Godbolt compiler explorer):
alignof_double:
.long 8
c_offset_d:
.long 4
So the 32-bit ABI will misalign double
inside a struct, even though it prefers to align double
to 8 bytes elsewhere, but the 64-bit ABI won't. The i386 System V ABI dates back a long time, maybe to actual 386 or 486 CPUs that maybe really did take two memory-read cycles to load a double
. The packing rule of only respecting alignment boundaries up to 4B makes sense for old CPUs, or for integer in 32-bit mode. A newly-designed 32-bit ABI would probably require that double
be aligned, and maybe also int64_t
(for use with MMX / SSE2). But breaking ABI compatibility to align 64-bit types inside struct
s would not be worth it.
See the x86 tag wiki for ABI docs.
Note that std::atomic<double>
does get full 8B alignment even in -m32
.
Note that a double variable will be allocated on 8 byte boundary on 32 bit machine and requires two memory read cycles.
A qword load or store (e.g. fld
or fstp
) to a 64-bit aligned address is guaranteed to be atomic (Since P5 Pentium), so it's definitely a single access to L1D cache (or to RAM for uncached access). See Why is integer assignment on a naturally aligned variable atomic on x86?.
This guarantee holds for x86 in general (including AMD and other vendors).
In fact, gcc -m32
implements std::atomic<int64_t>
with SSE2 movq
or x87 fild
loads / stores.
Wider and/or misaligned loads/store are not guaranteed to be a single access, but the are on some CPUs. e.g. for misaligned data that doesn't cross a 64B cache-line boundary, Intel Haswell/Skylake can do two 32B unaligned vector loads per cycle, each as a single read from L1D cache. If it does cross a cache-line boundary (like vmovups ymm0, [rdi+33]
where rdi
is 64B-aligned), throughput is limited to one per cycle because each load has to read and merge data from two cache lines.
The hardware support for unaligned loads is extremely good, so it just costs some extra load-use latency. 4k-splits are more expensive though, especially before Skylake.
It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU). Any floating point operation in the code will be translated into FPU instructions. The main processor is nothing to do with floating point execution.
This (and the hand-waving done after) is totally bogus. The FPU has been integrated into the main CPU core since 486DX. P6 (Pentium Pro) even added an x87 instruction that sets integer EFLAGS directly (fcomi
), and fcmovcc
that reads integer EFLAGS. FP and integer loads/stores even use the same execution ports in modern Intel CPUs.
One exception to this is AMD Bulldozer-family, where a pair of integer cores share an FP/vector unit. But they're still pretty tightly coupled, and FP loads still use the same dTLB and L1D cache.
According to David Kanter's Bulldozer writeup: there is a small floating point load buffer (not shown above) which acts as an analogous conduit for loads between the load-store units and the FP cluster. (i.e. for store-forwarding.)
Even Bulldozer still shares a single out-of-order ReOrder Buffer (ROB) between integer and FP/vector uops, and integer / FP instructions have to retire in program order (as always to support precise exceptions). Other AMD designs have separate schedulers, too, but that's a minor thing.
Intel CPUs use a single unified out-of-order scheduler for integer and FP, and execution ports have a mix of integer and FP ALUs. For example, Haswell port 0 can run integer shift and simple ALU uops, and it also has a vector multiply / FMA unit.
Unlike PowerPC, store-forwarding from an FP store to an integer load works fine. (On PPC, Load-Hit-Store stalls are a problem apparently. On x86, it just works without much more problem than regular store-forwarding. On Bulldozer it's slow-ish, and so is ALU movd r32, xmm
, because of coordinating 2 cores talking to one FPU.)
As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float
types will be promoted to 64 bit prior to execution.
Also bogus. With x87, the internal registers are 80-bit (64-bit mantissa!). This description is sort of right for x87, unless you set the x87 precision control register to 53-bit mantissa or 24-bit mantissa. (See Bruce Dawson's excellent series of floating-point articles. this one about Intermediate Float Precision mentions that on Windows, the D3D9 library sets the x87 FPU to 24-bit precision just so divide and sqrt will be somewhat faster, and that old versions of MSVCRT set it to 53-bit double
!)
But since this article is talking about 64-bit machines, it's a bad mistake to ignore the fact that both x86-64 Windows and Linux pass / return FP args in xmm
registers, and it's assumed that FP math will be done with SSE/SSE2 scalar or vector instructions, not x87. SSE2 instructions like mulsd
generate an IEEE binary64 result in an xmm register, so they round to 53-bit mantissa precision after every step. (And if you want faster division, you can just use divps
instead of divpd
. SSE doesn't have a precision-control register; you just use different instructions.)
Passing a float
to a variadic function like printf
will promote it to double
according to C's default promotion rules, but float a = f1 * f2;
doesn't have to promote to double and then round the result down to float
.
The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary. [...]
Hence, the address decoding will be different for double types (which is expected to be on 8 byte boundary). It means, the address decoding circuits of floating point unit will not have last 3 pins.
Total nonsense. Unaligned qword double
loads / stores are supported with x87 (fld
) and SSE2 (movsd
), and have been since 8086 for fld
.
Where as few processors will not have last two address lines, which means there is no-way to access odd byte boundary.
A CPU designed that way could just do 32-bit loads over the bus and extract the required bytes. This kind of argument is why it's so dumb the article doesn't mention cache.
Fun fact, though: old versions of ARM used the low 2 bits of an address as a byte rotate. So loading from 0xabc001
would get you the 4 bytes at 0xabc000
with a rotate applied. I've heard this was fun to debug compared to hardware that just faulted on unaligned loads :P
Early Alpha CPUs really did have no byte-load support, so you always had to do a 32 or 64-bit load and mask and/or shift to get the byte(s) you wanted.
I could say more about things that are wrong that this article implies...
I'm sure there are more problems that I'd see if I read carefully, but that's what I got from just skimming. The author of this article read some stuff about hardware 20 years ago, and has cooked up some wrong ideas based on it.