Why memory reordering is not a problem on single core/processor machines?

Question

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:

x = 0;
f = 0;

Thread #1:
   while (f == 0);
   print x;

Thread #2: 
   x = 42;
   f = 1;

I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.

However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:

When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.

As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?

what do you mean by "understand why this is not a problem on a single core machine." Do you mean in single core you always get same?? — Akhil Surapuram, Dec 06 '19 at 17:36
When an interrupt or mode change occurs, the speculative execution stops so there is no way speculation can leak in other threads. — curiousguy, Dec 18 '19 at 14:01
Note that this will actually work as intended (i.e. always print 42) in x64 since store-store reorders and load-load reorders are disallowed. — Guido, Mar 30 '22 at 21:30

score 7 · Accepted Answer · answered Dec 07 '19 at 07:31

The CPU would not be aware that these are two threads. Threads are a software construct (1).

So the CPU sees these instructions, in this order:

store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x

If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.

In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.

(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.

The CPU wouldn't know "threads", but it certainly knows "interrupts" and "supervisor mode". — curiousguy, Dec 18 '19 at 13:45
@curiousguy: Right, but CPUs do still need to preserve the illusion of sequential execution on a single core even when the sequence of instructions executed includes some instructions that are part of an interrupt handler, which execute in supervisor mode before jumping back to user-space. It knows what they are, but they're not special wrt. the requirement to preserve the illusion of sequential execution for a single core. A context switch is just storing some registers to memory and loading new values (including a stack pointer). — Peter Cordes, Sep 20 '21 at 21:43

Peter Cordes · Answer 2 · 2021-09-20T21:50:04.123

6

The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)

The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.

(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).

As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)

But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.

If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.

Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)

And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/

so what makes sure the program order is preserved?

Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location¹.

Without this, code like

  x = 1;
  int tmp = x;

might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.

We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.

Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction. This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

edited Sep 20 '21 at 21:50

answered Dec 07 '19 at 07:48

Peter Cordes

328,167
45
605
847

"_"strong" memory model is only acq_rel_" more specifically, it's loads are acq, stores are rel but not acq: **stores don't atomically read-acquire the previous value** (they don't read it at all) – curiousguy Dec 18 '19 at 13:57
1

@curiousguy: There are a few cases in C++11 where `mo_acq_rel` means acq for loads, rel for stores. e.g. as the arg for CAS_weak, `acq_rel` means the load-only part of failure has acq semantics. That's what I was thinking when I wrote it, but fair point, my explanation relied on the implicit knowledge that stores don't read-acquire. Changed it to "acq/rel" and pointed out that RMW ops are strong. [cppreference](https://en.cppreference.com/w/cpp/atomic/memory_order) doesn't mention acq_rel as breaking down into separate orders for pure-load or pure-store so again not helpful. – Peter Cordes Dec 18 '19 at 14:02
@PeterCordes I'm wondering whether any real processor can actually "see" insns from multiple threads (except hardware threads) at the same time (e.g. in the ROB) (in which case memory disambiguation would resolve memory ordering that would otherwise be visible), or does precise interrupts (pipeline/ROB flush on context switch) ensure that it always sees code from one thread only? – Daniel Nitzan Jun 05 '21 at 15:25
1

@DanielNitzan: Real fine-grained [SMT](https://en.wikipedia.org/wiki/Simultaneous_multithreading) hardware, like Intel Hyperthreading / AMD / POWER / Alpha EV8, can have uops from multiple hardware threads in flight at once the ROB (partitioned) / RS (competitively shared on Intel IIRC), but they never see each other's speculative state; otherwise a branch mispredict or other mis-speculation would require rolling back the whole core, defeating the purpose of SMT to keep the HW busy while one thread recovers from a stall such as a branch miss. – Peter Cordes Jun 05 '21 at 21:26
@DanielNitzan: POWER does let other SMT threads store-forward from "graduated" (aka retired) stores in the store buffer, because by definition those are non-speculative and thus ready to commit to L1d cache. [This is the source of its IRIW reordering](https://stackoverflow.com/questions/27807118/will-two-atomic-writes-to-different-locations-in-different-threads-always-be-see/50679223#50679223), and the reason ISAs like x86 that guarantee the existence of a total order for all stores can't do that. In practice Intel just partitions the store buffer. – Peter Cordes Jun 05 '21 at 21:29
@DanielNitzan: See also [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](https://stackoverflow.com/a/45610386) – Peter Cordes Jun 05 '21 at 21:29
@DanielNitzan: Oh, but I see you're talking about OoO exec overlap across a context switch. Precise interrupts are done (by normal CPU architectures) by in-order retirement. That does *not* require draining the store buffer on interrupts, but yes the fact that the privilege level isn't renamed means the ROB has to be drained. – Peter Cordes Jun 05 '21 at 21:34
1

@DanielNitzan: Of course, within the kernel the actual context switch is just a function that stores some register values and loads some new ones, including a new stack pointer, which means the `ret` will pop a return address from the new thread. The new user-space context is only restored as part of returning to user-space, and yes a memory barrier (implicit or explicit) is necessary to make sure suspending a thread on one core and resuming the same thread on another core lets a thread see all its own recent stores! At least release/acquire sync. – Peter Cordes Jun 05 '21 at 21:35
@PeterCordes Do debuggers use barriers as well to prevent the user from observing a delayed store? For instance, if execution is paused on an instruction just after a store, what prevents us from seeing an old value in memory, even if only for a brief moment? – Daniel Nitzan Sep 20 '21 at 16:56
1

@DanielNitzan: Under a modern OS, a debugger can only operate on another process via system calls. Single-stepping or reaching a breakpoint must be communicated to another thread somehow, and on a multi-core machine yes that would require a release store. (And acquire load in kernel side of the implementation of the `ptrace` system call waiting for that to happen.) – Peter Cordes Sep 20 '21 at 17:25
1

@DanielNitzan: But if everything is happening on the same (logical) core like this Q&A is about, then even if the single-step / breakpoint interrupt isn't fully serializing, the debugger is running on the same core as the code being debugged. So the debugger's loads see *all* previous stores, because anything else would violate the cardinal rule of CPU architecture, which is to preserve the "as-if" appearance for a single core of each instruction fully executing before the next one starts, in execution order into interrupt handlers etc. – Peter Cordes Sep 20 '21 at 17:27
@PeterCordes I understand the requirement for at least a release/acquire sync to properly propagate the store and anything that happens-before it, but don’t we actually need a full fence to also flush any pending stores when we stop on a breakpoint? Otherwise they may not be visible (briefly) – Daniel Nitzan Sep 20 '21 at 20:54
1

@DanielNitzan: No. If the breakpoint / single-step interrupt handler (on the core that was running the code being debugged) does a release store, a load on another core which sees that store is also guaranteed to see *all* preceding stores from the core / thread being debugged. That's the whole point here; there aren't separate store-buffers per software thread, just one per logical core, and that's the same core that was running the code which generated the debug trap (because traps like are local, not inter-processor interrupts, just like page faults or whatever). – Peter Cordes Sep 20 '21 at 21:14
@DanielNitzan: Note that seeing a release store on another core means that all earlier stores must have already been drained from the store buffer. We just don't need to make the core doing the stores stall while it *waits* for that to happen. (Or on PowerPC where private forwarding between SMT threads is allowed by the memory model and happens on real hardware, earlier stores are visible via that mechanism.) – Peter Cordes Sep 20 '21 at 21:18
@DanielNitzan: In any case, none of that is relevant on a single-core machine like this Q&A is about. Still not clear if you're asking about that or SMP. With a single core, the CPU hardware maintains the illusion of sequential execution, usually via store-forwarding. That's why single-threaded code doesn't need memory barriers to see its own stores. This happens in terms of physical address in most designs so it isn't subject to aliasing problems even if the store is to one virtual page and the load is from another virtual page, both backed by the same physical page. – Peter Cordes Sep 20 '21 at 21:22
1

@PeterCordes I was implicitly referring to the SMP case. I haven't realized that the other core essentially waits on the interrupt handler's store in its `ptrace` syscall. With that understanding it all makes sense to me now, thanks a ton. And thanks for elaborating on it and mentioning the fun fact about physical address involvement in dependency hazards! – Daniel Nitzan Sep 20 '21 at 22:06
1

@DanielNitzan: yeah, if you're single-stepping another process, it's not even already running. You don't blindly fire off a "single step please" message to another core and hope it happens before you do something else like `ptrace(PTRACE_PEEKDATA)` (memory) or `ptrace(PTRACE_PEEKUSER)` (register). I can totally see how someone might think of that mental model first, but note that no amount of barriers could make it safe. Barrier instructions only order a core's own access to shared memory (coherent cache); they don't make one core wait for another. – Peter Cordes Sep 20 '21 at 22:54

Why memory reordering is not a problem on single core/processor machines?

2 Answers2

Linked

Related