0

This article says:

Note that a double variable will be allocated on 8 byte boundary on 32 bit machine and requires two memory read cycles.

So this means that an x86 CPU have some instruction(s) that reads a double value (I thought that the max value an x86 CPU can read is 4 bytes!). Can anybody provide an example of an instruction that reads a double value?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user8426277
  • 587
  • 3
  • 17
  • 4
    [_FLD_](http://www.felixcloutier.com/x86/FLD.html) instruction with a QWORD/REAL8 memory operand would do it. If you have a processor with SSE [_MOVSD_](http://www.felixcloutier.com/x86/MOVSD.html) would be able to load a double. – Michael Petch Aug 26 '17 at 01:03
  • 4
    You should not get your information from bad articles like that.. I guess *some* of it is true enough, but how could you tell which part that is? – harold Aug 26 '17 at 03:18
  • 2
    We had a reference to the same article not long ago. A lot of what it "explains" is just not true. And details like `printf("%d", sizeof(x));` is in error, as "%d" doesn't match the return type of `sizeof`. Not a good resource to learn from! – Bo Persson Aug 26 '17 at 11:07
  • @BoPersson: Those questions are deleted now. I think it was this user that posted those questions. Just in case anyone ever links it again, I've debunked it fairly thoroughly. As @ harold says, this much crap in the parts I talk about are a bad sign for the other parts! – Peter Cordes Aug 26 '17 at 22:55

1 Answers1

4

Can anybody provide an example of an instruction that reads a double value?

MOVSD xmm0, [rsi] loads 8 bytes and zeros the upper half of xmm0.
For legacy x87, there's fld qword [rsi]. And of course you can use a memory operand to an ALU instruction, like addsd xmm0, [rsi].
Or with AVX, there's stuff like vbroadcastsd ymm0, [rsi], or vaddsd xmm1, xmm0, [rsi].

All of these decode a single uop on all modern x86 CPUs, and do a single access to cache.

I thought that the max value an x86 CPU can read is 4 bytes!

huh? 8-byte x87 loads have been supported since 8086. And in 64-bit mode, mov rax, [rdi] or pop rax are both 8-byte loads.

With AVX, you can do vmovups ymm0, [rsi + rdx] (even in 32-bit mode) to do a 32-byte load. Or with AVX512, vmovups zmm0 for a 64-byte load or store.

Cache lines are 64-bytes on modern CPUs, so memory is (logically) copied around in 64-byte chunks inside the CPU and between cores. Intel CPUs use a 32-byte bus between cores (and between L2 and L3 on the way to memory).


See the tag wiki for lots of good links to manuals and (accurate) documents / guides / articles. If something on a site like TutorialsPoint or GeeksForGeeks looks confusing or doesn't match what you're reading elsewhere, there's a good chance it's just wrong. Without a voting mechanism like SO has, the inaccurate content doesn't get weeded out.

Let's just set the record straight about http://www.geeksforgeeks.org/structure-member-alignment-padding-and-data-packing/. It is full of wrong information, and most of the stuff it says about hardware might have approximately right in 1995, but isn't now. A lot of the logic / reasoning was wrong even then.

It doesn't even contain the word "cache"! Talking about access to RAM and "banks" is total nonsense for modern x86 CPUs. The same concept sort of applies for a banked cache, so alignment boundaries like 16B or 32B matter for some CPUs even though cache lines are 64B on all x86 CPUs (after Pentium III or so).

Modern x86 has very good unaligned access support in general, especially for 8B and narrower, but there can be a penalty for crossing a 16B or especially 32B boundary on AMD CPUs including Ryzen.

on latest processors we are getting size of struct_c as 16 bytes. [...]

On older processors (AMD Athlon X2) using same set of tools (GCC 4.7) I got struct_c size as 24 bytes. The size depends on how memory banking organized at the hardware level.

This is obviously nonsense. struct layout has to be the same for all compilers targeting the same ABI, regardless of what -march=pentium3 or -mtune=znver1 setting is used, or what hardware you compile on, so you can link with a library that passes (pointers to) struct types from your code into library functions or vice versa. An obvious example being the stat(const char *pathname, struct stat *statbuf) system call, where you pass a pointer and the kernel writes fields in the struct. If your code didn't agree with the kernel about which bytes in memory represented which C struct members, your code wouldn't work. Specifying the layout / alignment rules (and the calling convention) is a major part of what an ABI is.

It's very likely that the "newer" test was targeting the 32-bit i386 System V psABI, while the "older" test was compiling 64-bit code for the x86-64 System V psABI (or Windows 32 or 64 bit, which both have 24 byte structc with MSVC CL19).

typedef struct structc_tag {
   char        c;
   double      d;
   int         s;
} structc_t;
int sc = sizeof(structc_t);

#include  <stddef.h>
int alignof_double = alignof(double);
int c_offset_d = offsetof(structc_t, d);

Compiler output for clang -m32 (Godbolt compiler explorer):

alignof_double:
    .long   8
c_offset_d:
    .long   4

So the 32-bit ABI will misalign double inside a struct, even though it prefers to align double to 8 bytes elsewhere, but the 64-bit ABI won't. The i386 System V ABI dates back a long time, maybe to actual 386 or 486 CPUs that maybe really did take two memory-read cycles to load a double. The packing rule of only respecting alignment boundaries up to 4B makes sense for old CPUs, or for integer in 32-bit mode. A newly-designed 32-bit ABI would probably require that double be aligned, and maybe also int64_t (for use with MMX / SSE2). But breaking ABI compatibility to align 64-bit types inside structs would not be worth it.

See the tag wiki for ABI docs.

Note that std::atomic<double> does get full 8B alignment even in -m32.

Note that a double variable will be allocated on 8 byte boundary on 32 bit machine and requires two memory read cycles.

A qword load or store (e.g. fld or fstp) to a 64-bit aligned address is guaranteed to be atomic (Since P5 Pentium), so it's definitely a single access to L1D cache (or to RAM for uncached access). See Why is integer assignment on a naturally aligned variable atomic on x86?.

This guarantee holds for x86 in general (including AMD and other vendors). In fact, gcc -m32 implements std::atomic<int64_t> with SSE2 movq or x87 fild loads / stores.

Wider and/or misaligned loads/store are not guaranteed to be a single access, but the are on some CPUs. e.g. for misaligned data that doesn't cross a 64B cache-line boundary, Intel Haswell/Skylake can do two 32B unaligned vector loads per cycle, each as a single read from L1D cache. If it does cross a cache-line boundary (like vmovups ymm0, [rdi+33] where rdi is 64B-aligned), throughput is limited to one per cycle because each load has to read and merge data from two cache lines.

The hardware support for unaligned loads is extremely good, so it just costs some extra load-use latency. 4k-splits are more expensive though, especially before Skylake.

It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU). Any floating point operation in the code will be translated into FPU instructions. The main processor is nothing to do with floating point execution.

This (and the hand-waving done after) is totally bogus. The FPU has been integrated into the main CPU core since 486DX. P6 (Pentium Pro) even added an x87 instruction that sets integer EFLAGS directly (fcomi), and fcmovcc that reads integer EFLAGS. FP and integer loads/stores even use the same execution ports in modern Intel CPUs.

One exception to this is AMD Bulldozer-family, where a pair of integer cores share an FP/vector unit. But they're still pretty tightly coupled, and FP loads still use the same dTLB and L1D cache.

According to David Kanter's Bulldozer writeup: there is a small floating point load buffer (not shown above) which acts as an analogous conduit for loads between the load-store units and the FP cluster. (i.e. for store-forwarding.)

Even Bulldozer still shares a single out-of-order ReOrder Buffer (ROB) between integer and FP/vector uops, and integer / FP instructions have to retire in program order (as always to support precise exceptions). Other AMD designs have separate schedulers, too, but that's a minor thing.

Intel CPUs use a single unified out-of-order scheduler for integer and FP, and execution ports have a mix of integer and FP ALUs. For example, Haswell port 0 can run integer shift and simple ALU uops, and it also has a vector multiply / FMA unit.

Unlike PowerPC, store-forwarding from an FP store to an integer load works fine. (On PPC, Load-Hit-Store stalls are a problem apparently. On x86, it just works without much more problem than regular store-forwarding. On Bulldozer it's slow-ish, and so is ALU movd r32, xmm, because of coordinating 2 cores talking to one FPU.)

As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.

Also bogus. With x87, the internal registers are 80-bit (64-bit mantissa!). This description is sort of right for x87, unless you set the x87 precision control register to 53-bit mantissa or 24-bit mantissa. (See Bruce Dawson's excellent series of floating-point articles. this one about Intermediate Float Precision mentions that on Windows, the D3D9 library sets the x87 FPU to 24-bit precision just so divide and sqrt will be somewhat faster, and that old versions of MSVCRT set it to 53-bit double!)

But since this article is talking about 64-bit machines, it's a bad mistake to ignore the fact that both x86-64 Windows and Linux pass / return FP args in xmm registers, and it's assumed that FP math will be done with SSE/SSE2 scalar or vector instructions, not x87. SSE2 instructions like mulsd generate an IEEE binary64 result in an xmm register, so they round to 53-bit mantissa precision after every step. (And if you want faster division, you can just use divps instead of divpd. SSE doesn't have a precision-control register; you just use different instructions.)

Passing a float to a variadic function like printf will promote it to double according to C's default promotion rules, but float a = f1 * f2; doesn't have to promote to double and then round the result down to float.

The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary. [...]

Hence, the address decoding will be different for double types (which is expected to be on 8 byte boundary). It means, the address decoding circuits of floating point unit will not have last 3 pins.

Total nonsense. Unaligned qword double loads / stores are supported with x87 (fld) and SSE2 (movsd), and have been since 8086 for fld.

Where as few processors will not have last two address lines, which means there is no-way to access odd byte boundary.

A CPU designed that way could just do 32-bit loads over the bus and extract the required bytes. This kind of argument is why it's so dumb the article doesn't mention cache.

Fun fact, though: old versions of ARM used the low 2 bits of an address as a byte rotate. So loading from 0xabc001 would get you the 4 bytes at 0xabc000 with a rotate applied. I've heard this was fun to debug compared to hardware that just faulted on unaligned loads :P

Early Alpha CPUs really did have no byte-load support, so you always had to do a 32 or 64-bit load and mask and/or shift to get the byte(s) you wanted.

I could say more about things that are wrong that this article implies...


I'm sure there are more problems that I'd see if I read carefully, but that's what I got from just skimming. The author of this article read some stuff about hardware 20 years ago, and has cooked up some wrong ideas based on it.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • This deleted question might be the one you were thinking of? https://stackoverflow.com/questions/45804925/confused-about-data-alignment-for-short-and-int-variables – Michael Petch Aug 26 '17 at 23:00
  • @MichaelPetch: yes, thanks. So it was this same user who still didn't get the message that article is garbage. I though I made that clear in comments before :/ Hopefully this answer clears things up. – Peter Cordes Aug 26 '17 at 23:04
  • 2
    GeekforGeeks isn't the only one filled with bad information. tutorialspoint has a lot of bad material as well (and often cited in SO questions) – Michael Petch Aug 26 '17 at 23:41
  • "The FPU has been integrated into the main CPU core since Pentium." Technically it has been integrated since the 486. The separate 487 was just a full 486 (a CPU including the FPU) that took over from the other 486 (with the FPU disabled). – ecm Oct 20 '20 at 16:12
  • 1
    @ecm: Thanks. I think "since 486DX" is accurate. – Peter Cordes Oct 20 '20 at 22:15
  • Good explanation supported by actual IRL examples. But I wonder, in a hypothetical scenario where the processor is unnamed, and we are making thing up, is it so unreasonable to assume the cache only has 32-bit data lines on a 32-bit processor? – Everyone Apr 27 '22 at 04:12
  • 1
    @Everyone: It's unreasonable to *assume* that, but it's also a possibility that can't be ruled out so you can't assume 8-byte cache access either. I'd assume there are some real (at least non-x86) 32-bit CPUs where load/store of an aligned 64-bit `double` takes two 4-byte accesses to cache. IDK if that might have been the case on 486DX; I think it had some cache, but 64-bit load/store atomicity guarantees were new in P5 Pentium. Outside x86, sure, could be true in some low-end 32-bit CPUs, like some ARMs with a hardware FPU but without cache access that can work in 8-byte chunks. – Peter Cordes Apr 27 '22 at 04:32
  • @PeterCordes the reason I ask is because I was looking at some academic questions and seen one that made you assume a cache is empty and the only thing put in it is an array, while the loop iterators somehow remain outside, which is a very unrealistic hypothetical in the practical world. It also assumed a 4-entry TLB, also very unrealistic. It never specified the cache dataline width. Under such hypothetical with many extremely unreasonable assumptions given, assuming a 32-bit wide cache data line seems the most reasonable. But I get your point. – Everyone Apr 27 '22 at 05:19
  • @Everyone: Most loops over arrays keep pointers in registers, so that sounds totally normal for real-world code and real-world (optimizing) compilers, or hand-written asm. e.g. https://godbolt.org/z/Eq5nrfT1s. Computer architecture practice problems often simplify to tiny caches, often with tiny cache *lines* (like 1 word each), especially for simulating by hand with a a trace of accesses. The range of what you could build in theory / on paper is much wider than the range of things that are *commercially* viable and worth actually doing. – Peter Cordes Apr 27 '22 at 06:53
  • @PeterCordes well the question asked you to assume there is nothing at all in a 4KB sized cache, and the only thing it will be filled with is the array. That is, there are no other programs and nothing else but the array to be put in the cache. – Everyone Apr 27 '22 at 09:55
  • @Everyone: These simplifying assumptions sound totally fine to me. Unless you're getting into details of pseudo-LRU, there's no difference between invalid and valid but caching data you don't care about. For the purposes of the workload you're analyzing, the initial access is a mandatory miss. It's realistic if the CPU just woke up from a deep sleep-state (where caches power off); all lines invalid. Or first access to the array after page fault + DMA loaded it from disk, like a freshly loaded process). Or after coming back to this array after touching lots of other memory in the meantime. – Peter Cordes Apr 27 '22 at 10:01
  • @PeterCordes im not criticizing the assumptions, I'm just stating that assuming a 32-bit data line for the cache is reasonable in such hypothetical scenarios where it isn't specified. And I think the question should not ignore that detail when it asks about cache hit-rate, knowing the cache data line width is crucial for an accurate measurement. If the double needs 2 accesses it is a wholly difference story than a single access. – Everyone Apr 27 '22 at 10:05
  • @Everyone: A question that requires you to make assumptions like that is not a good question. Line-size is a pretty crucial design choice for a cache. Of course, IDK why you'd ever assume the cache line size is only 4 bytes, instead of a more typical 32B, if it wasn't given. Two back-to-back accesses to the *same* cache line for an aligned 8-byte double is a totally different thing from two separate misses to separate cache lines. Back-to-back accesses within the same line always have perfect spatial and temporal locality (barring out-of-order exec or unlucky timing of another core's invd) – Peter Cordes Apr 27 '22 at 10:15
  • @Everyone: the width of a load/store execution unit is basically independent of cache line size, except as an upper bound. Are you conflating those? (Current x86 CPUs with AVX-512 can load or store a whole line with a single instruction, but that's a very recent development, and not much code uses 64-byte vectors.) In x86, cache line widths increased from 32 bytes to 64 bytes around the Core2 era, when the max SIMD load/store width was 16 bytes. Before now, the widest execution unit has always been narrower than a cache line by a factor of 4 until AVX1 made it a factor of 2. – Peter Cordes Apr 27 '22 at 10:19
  • @PeterCordes hmm, I might be a bit confused, but cache line size is not what I meant. I meant to talk about processor access over data in the cache, that resides somewhere in the cache line. The first cache miss will ensure the entire line is loaded into the cache, making the locality beneficial for the remaining bytes of the same line. Let's say there are 64-bytes in the cache line, that's 8 doubles. How many accesses does each double need? This is the information missing from the question, while the answer forces the assumption of 1. Does this make sense? – Everyone Apr 27 '22 at 10:27
  • @Everyone: Oh, so even when you said "data line width", you were talking about data path width. "Access width" is probably a good term for that. Definitely want to avoid one that includes (signal) "line", since "line width" already has a specific technical meaning, and as you saw, people will misunderstand you. Even after reading somewhat carefully, I thought you'd started out talking about access width and then switched to talking about line width, even after stopping to think about it and double-check your wording. – Peter Cordes Apr 27 '22 at 10:32
  • @Everyone: As I said before, two back-to-back accesses to the same cache line are a special case because they essentially *always* hit. You shouldn't count that as two accesses that both hit. Unless you're simulating in more detail and have limited store-buffer and load-buffer entries. (Even then, in a normal implementation, on a load miss, both halves of the load can track the same off-core request for that cache line, not request it separately, if you allow miss-under-miss memory-level parallelism instead of just stalling all later accesses. Just like real CPUs do for separate load insns) – Peter Cordes Apr 27 '22 at 10:35
  • @PeterCordes after reading the terms again I get why the confusion would arise, yes path is what I meant instead of line when I was referring to the width aspect. My bad. But at the end, the cache hit-rate will be dependent on whether or not it can load a full double at once, or if it needed two different accesses, no? That is, if the cache block can fit 8 doubles simultaneously, and a double takes a single load, that means 12.5% of the loads were misses (1/8). If the double needed two accesses, would that mean the hit miss rate is halved? It is a separate access, no? – Everyone Apr 27 '22 at 10:38
  • @Everyone: No, I already explained the reasons why it's normal to consider the two halves of a wide access as a single access, for the purposes of calculating cache hit rates. It's not an interesting extra access, because it's *always* to a line that was just loaded. Trivial problems like iterating sequentially over an array without HW prefetching just exist to teach about spatial locality; most real CPUs have some memory-level parallelism. Hit rate in terms of the access width is never the relevant metric, even for sequential accesses on a simplistic CPU without HW prefetching. – Peter Cordes Apr 27 '22 at 11:00
  • @Everyone: If you did do the math that way, and halved your miss rate because you're doing a guaranteed hit as the 2nd of two accesses for every `double`, you'd then have to scale by 2 to cancel it out if you ever want to calculate anything about how your code performs, especially if iterating sequentially. Because your code needs two accesses for every `double`, and you'd want to calculate stuff about how fast your code can access doubles, not halves of doubles. If a problem doesn't mention this, you 100% should assume that each access it talks about is (or counts as) a single access. – Peter Cordes Apr 27 '22 at 11:04