14

I have read that some CPUs reorder instructions, but this is not a problem for single threaded programs (the instructions would still be reordered in single threaded programs, but it would appear as if the instructions were executed in order), it is only a problem for multithreaded programs.

To solve the problem of instructions reordering, we can insert memory barriers in the appropriate places in the code.

But does an x86 CPU reorder instructions? If it does not, then there is no need to use memory barriers, right?

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
Steve
  • 705
  • 5
  • 13
  • 1
    modern x86 does not only reorder instructions, but translates them into micro-instructions. You need memory barriers when MT even when there's is no instruction reordering, if the writes to memory are not guaranteed to land in the original order, i.e. it depends not only on out-of-order execution of instructions, but also on the memory model, the memory model may be weak enough to reshuffle order of memory changes appearing to other cores. (IIRC the x86 has very "strong" memory model, resolving many of those complexities for programmer, but then x86 is reordering, so you still need barriers). – Ped7g May 12 '18 at 15:17
  • 5
    Memory reordering is independent of out-of-order execution. An in-order CPU starts instruction in-order, but they can still complete out of order, and stores are buffered. See http://preshing.com/20120515/memory-reordering-caught-in-the-act for when you need `mfence` on x86: only to prevent StoreLoad reordering; AFAIK you still need `mfence` there on in-order Atom or Pentium CPUs. (But all modern x86 CPUs have fully out-of-order execution.) – Peter Cordes May 12 '18 at 15:39
  • @peter I think saying they are fully independent is a bit of a stretch. Out of order memory access is _mostly_ driven by out of order instruction access in the first place. In a pure in-order CPU you may definitely also have in-order memory access. Sometimes you might find a store buffer on such a CPU which might result in x86-like redordering on in-order CPUs, but as soon as you move to out-of-order you definitely want out of order memory accesses as well or else actual ILP will be quite limited in cases where memory accesses have different latencies. – BeeOnRope May 12 '18 at 20:43
  • That fact (out of order execution driving generally the need for OoO memory access) is part of the drive behind weaker memory models in many of the recent chips. As a practical matter you want to allow out of order access at the hardware level: the "easy" approach is just to make your ISA memory model weak so you don't need to do anything special, while the "hard" approach (eg x86) is to use a strong model and then still do stuff all out of order but add the queues and other tracking mechanisms to determine when this visibly broke the rules and roll back. – BeeOnRope May 12 '18 at 20:46
  • @BeeOnRope: Yeah that's fair in general, but x86 only allows StoreLoad reordering, and that's the kind you get just from the existence of a store buffer, which you'll find even on early in-order classic RISC CPUs: see the last paragraph of https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Exceptions. More advanced in-order CPUs can proceed past loads and only stall when the target register is read before the data is ready (cache miss or not). e.g. in-order ARM code can hide load latency by doing a block of loads before using the result of the first load, or in general software-pipelining. – Peter Cordes May 12 '18 at 20:50
  • Sure but I'm trying to debate the claim "Memory reordering is independent of out-of-order execution." Is that an x86 only claim? I didn't read it that way: if we're talking about x86 ISA guarantees it doesn't have much to do with out-of-order processing or store buffers at all, you're just left reading the guarantees about ordering pretty much. I interpreted it as a generated statement about CPU arch/uarch - that OoO and memory ordering are independent. Even in the x86-only case I kind of disagree but my objection there is more subtle. @peter – BeeOnRope May 12 '18 at 21:28
  • 1
    @BeeOnRope: You're right that it's a bit of an overstatement. I should have said that memory-reordering *can* happen without OoO exec. But really, checking whether a CPU does out-of-order exec is the wrong thing to ask for figuring out where / when you need memory barriers. x86's strong memory model means you don't need barriers in some cases, even with aggressive OoO exec, so again you need to know the memory model, not the exec model. – Peter Cordes May 12 '18 at 21:33
  • 2
    Yes, agree 100%. In fact I just realized the original version of my answer was wrong since it read as "Yes, x86 reorders instructions so yes you need memory barriers. ". That's wrong (the _so_ part) and I think what you are getting at above. I changed it so it's like more independent now :). I actually agree they are independent mostly at the ISA/documentation level, but heavily linked at the CPU design uarch level (but OoO reordering isn't the only reason for memory reordering as you point out). @peter – BeeOnRope May 12 '18 at 21:38
  • 1
    Now I want to use "independent" in my answer. There has to be a better word which means "is not implied by (or vice-versa), but may be correlated with...". – BeeOnRope May 12 '18 at 21:42
  • @BeeOnRope: I can't think of a single word that captures the meaning while making the correlation-doesn't-imply-causation part explicit. Maybe "associated with, but you can at least in theory have either without the other." – Peter Cordes May 12 '18 at 21:46
  • @PeterCordes Yeah, maybe I don't need one work anyways. I updated my answer to really drive the point home that the link between (internal) out-of-order execution and (visible) memory re-ordering is basically a false one. Does it make sense? – BeeOnRope May 12 '18 at 23:04
  • 1
    @PeterCordes Side note: The processor for Intel's Quark is in-order. While it is dead as a product, the processor appears to be used by Intel as a system management processor. –  May 18 '18 at 11:47
  • @PaulA.Clayton: hmm. IDK if Intel still sells any in-order Atom. Knight's Corner was also in-order, based on the P5 uarch, but KNL is the current-gen Xeon Phi. Basically, unless you're targeting your code at a specific in-order machine, these days it will be running on an OoO CPU, or an old netbook... Silvermont has in-order SIMD, with proper OoO only for integer instructions, so that's another oversimplification. It's close enough to true these days that I don't mind simplifying to "modern x86 is OoO", and just say that every x86 that's not OoO isn't modern. :P – Peter Cordes May 18 '18 at 20:39
  • @PeterCordes For x86 in-order probably only makes sense for a secondary processor (e.g., to handle I/O) where software compatibility would still have value (or possibly a highly threaded processor, but then adding modest OoO execution would not be that much added cost). Anyway, it was just a side note that Intel was still using (and in a sense, still selling) an in-order processor though it is not exposed to an application developer. –  May 18 '18 at 23:33
  • @PeterCordes do you mean that even memory is in order, the instructions can still be out of order? – choxsword Feb 12 '21 at 14:58
  • 1
    @scottxiao: Yes (as shown in [an example with two imul dependency chains](https://stackoverflow.com/questions/50494658/are-loads-and-stores-the-only-instructions-that-gets-reordered/50496379)). and vice versa (in-order instruction execution, memory reordering created by the store buffer. And on weakly-ordered ISAs that allow it, also by hit-under-miss load ordering.) – Peter Cordes Feb 12 '21 at 20:14

1 Answers1

32

Reordering

Yes, all modern x86 chips from Intel and AMD aggressively reorder instructions across a window which is around 200 instructions deep on recent CPUs from both manufacturers (i.e. a new instruction may execute while an older instruction more than 200 instructions "in the past" is still waiting). This is generally all invisible to a single thread since the CPU still maintains the illusion of serial execution1 by the current thread by respecting dependencies, so from the point of view of the current thread of execution it is as-if the instructions were executed serially.

Memory Barriers

That should answer the titular question, but then your second question is about memory barriers. It contains, however, an incorrect assumption that instruction reordering necessarily causes (and is the only cause of) visible memory reordering. In fact, instruction reordering is neither sufficient nor necessary for cross-thread memory re-ordering.

Now it is definitely true that out-of-order execution is a primary driver of out-of-order memory access capabilities, or perhaps it is the quest for MLP (Memory Level Parallelism) that drives the increasingly powerful out-of-order abilities for modern CPUs. In fact, both are probably true at once: increasing out-of-order capabilities benefit a lot from strong memory reordering capabilities, and at the same time aggressive memory reordering and overlapping isn't possible without good out-of-order capabilities, so they help each other in kind of a self-reinforcing sum-greater-than-parts kind of loop.

So yes, out-of-order execution and memory reordering certainly have a relationship; however, you can easily get re-ordering without out-of-order execution! For example, a core-local store buffer often causes apparent reordering: at the point of execution the store isn't written directly to the cache (and hence isn't visible at the coherency point), which delays local stores with respect to local loads which need to read their values at the point of execution.

As Peter also points out in the comment thread you can also get a type of load-load reordering when loads are allowed to overlap in an in-order design: load 1 may start but in the absence of an instruction consuming its result a pipelined in-order design may proceed to the following instructions which might include another load 2. If load 2 is a cache hit and load 1 was a cache miss, load 2 might be satisfied earlier in time from load 1 and hence the apparent order may be swapped re-ordered.

So we see that not all cross-thread memory re-ordering is caused by instruction re-ordering, but certain instruction re-ordering also implies out-of-order memory access, right? No so fast! There are two different contexts here: what happens at the hardware level (i.e., whether memory access instructions can, as a practical matter, execute out-of-order), and what is guaranteed by the ISA and platform documentation (often called the memory model applicable to the hardware).

x86 re-ordering

In the case of x86, for example, modern chips will freely re-order more or less any stream of loads and stores with respect to each other: if a load or store is ready to execute, the CPU will usually attempt it, despite the existence of earlier uncompleted load and store operations.

At the same time, x86 defines quite a strict memory model, which bans most possible reorderings, roughly summarized as follows:

  • Stores have a single global order of visibility, observed consistently by all CPUs, subject to one loosening of this rule below.
  • Local load operations are never reordered with respect to other local load operations.
  • Local store operations are never reordered with respect to other local store operations (i.e., a store that appears earlier in the instruction stream always appears earlier in the global order).
  • Local load operations may be reordered with respect to earlier local store operations, such that the load appears to execute earlier wrt the global store order than the local store, but the reverse (earlier load, older store) is not true.

So actually most memory re-orderings are not allowed: loads with respect to each outer, stores with respect to each other, and loads with respect to later stores. Yet I said above that x86 pretty much freely executes out-of-order all memory access instructions - how can you reconcile these two facts?

Well, x86 does a bunch of extra work to track exactly the original order of loads and stores, and makes sure no memory re-orderings that breaks the rules is ever visible. For example, let's say load 2 executes before load 1 (load 1 appears earlier in program order), but that both involved cache lines were in the "exclusively owned" state during the period that load 1 and load 2 executed: there has been reordering, but the local core knows that it cannot be observed because no other was able to peek into this local operation.

In concert with the above optimizations, CPUs also uses speculative execution: execute everything out of order, even if it is possible that at some later point some core can observe the difference, but don't actually commit the instructions until such an observation is impossible. If such an observation does occur, you roll back the CPU to an earlier state and try again. This is cause of the "memory ordering machine clear" on Intel.

So it is possible to define an ISA that doesn't allow any re-ordering at all, but under the covers do re-ordering but carefully check that it isn't observed. PA-RISC is an example of such a sequentially consistent architecture. Intel has a strong memory model that allows one type of reordering, but disallows many others, but each chip internally may do more (or less) re-ordering as long as they can guarantee to play by the rules in an observable sense (in this sense, it somewhat related to the "as-if" rule that compilers play by when it comes to optimizations).

The upshot of all that is that yes, x86 requires memory barriers to prevent specifically the so-called StoreLoad re-ordering (for algorithms that require this guarantee). You don't find many standalone memory barriers in practice in x86, because most concurrent algorithms also need atomic operations, such as atomic add, test-and-set or compare-and-exchange, and on x86 those all come with full barriers for free. So the use of explicit memory barrier instructions like mfence is limited to cases where you aren't also doing an atomic read-modify-write operation.

Jeff Preshing's Memory Reordering Caught in the Act has one example that does show memory reordering on real x86 CPUs, and that mfence prevents it.


1 Of course if you try hard enough, such reordering is visible! An high-impact recent example of that would be the Spectre and Meltdown exploits which exploited speculative out-of-order execution and a cache side channel to violate memory protection security boundaries.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • *"Not all memory reordering is caused by instruction re-oredring"* What else (other than instruction reordering) causes memory reordering? – Steve May 12 '18 at 22:26
  • 2
    @Steve - take a look at my comment thread below the question with Peter. The primary example is a store buffer, which may be present on chips that don't re-order instructions at all. The example is also given of in-order chips that allow MLP - this could cause load-load reordering if the responses come back in a different order than instruction order (e.g., because an older load misses and a new load hits). I'm updating my answer to make this clearer. – BeeOnRope May 12 '18 at 22:31
  • @Steve - I updated my question to make this all clearer (I hope). Let me know if there is any confusion. – BeeOnRope May 12 '18 at 23:03
  • @Bee: Good update, this captures what we were discussing in comments on the question. – Peter Cordes May 13 '18 at 01:01
  • @BeeOnRope Thank you for the amazing answer to this question! I'm a student at DePaul University and our group was tasked with turning this question/solution into a presentation and video. Our group would like to post a link to our presentation so anyone trying to wrap their head around this problem in the future will have an extra resource. Thanks again for sharing! Here's our work: https://docs.google.com/presentation/d/1mYq9ylK0nRa-TyTw6MuThILN6cDzA_HtbDkkTmK12LE/edit?usp=sharing – jblew Nov 18 '20 at 01:17
  • @BeeOnRope does intel's restriction on memory re-order cause instruction to not be reordered? – choxsword Feb 12 '21 at 14:53
  • No, they still reorder memory instructions but then just check before retirement that the reordering wasn't visible. – BeeOnRope Feb 13 '21 at 06:58
  • i am curious of what the cpu will do if interrupt arrive between two storeLoad instruction?wait the store to complete or do noting with this reorder? – Chinaxing Jun 10 '21 at 14:07