Are memory barriers needed because of cpu out of order execution or because of cache consistency problem?

Question

I'm wonderring why are memory barriers needed and I have read some articles about this toppic.
Someone says it's because of cpu out-of-order execution while others say it is because of cache consistency problems which store buffer and invalidate queue cause.
So, what's the real reason that memory barriers are needed? cpu out-of-order execution or cache consistency problems? or both? Does cpu out-of-order execution have something to do with cache consistency? and what's the difference between x86 and arm?

It has to do with neither specifically. They basically stop new transactions and allows transactions in flight to complete to avoid race conditions that can cause something undesirable/predictable to happen within a specific system design. Allows you to perform specific transactions into a system in a known state. — old_timer, Sep 20 '20 at 00:06
With all the parallel things going on normally it is essentially controlled chaos, this will pause the chaos. Like stopping traffic to help a slow/elderly person across the road, and then the chaos can continue. — old_timer, Sep 20 '20 at 00:10
Some systems will have separate instruction barriers and data barriers to handle or isolate the different areas. The places where you need them are very specific to a system that doesnt mean x86 this and arm that or cache this and pipeline that, but this specific x86 processor, this specific arm core implemented in this way needs a barrier before performing this operation. And not all x86 processors or arm cores need it in that place for that operation. They are used to prevent potential race conditions causing undesirable or unpredictable results. — old_timer, Sep 20 '20 at 00:36

score 5 · Accepted Answer · answered Sep 19 '20 at 15:44

5

You need barriers to order this core / thread's accesses to globally-visible coherent cache when the ISA's memory ordering rules are weaker than the semantics you need for your algorithm.

Cache is always coherent, but that's a separate thing from consistency (ordering between multiple operations).

You can have memory reordering on an in-order CPU. In more detail, How is load->store reordering possible with in-order commit? shows how you can get memory reordering on a pipeline that starts executing instructions in program order, but with a cache that allows hit-under-miss and/or a store buffer allowing OoO commit.

Does an x86 CPU reorder instructions? talks about the difference between memory reordering vs. out of order exec. (And how x86's strongly ordered memory model is implemented on top of aggressive out-of-order execution by having hardware track ordering, with the store buffer decoupling store execution from store visibility to other threads/cores.)
x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding
Globally Invisible load instructions

See also https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and https://preshing.com/20120930/weak-vs-strong-memory-models for some more basics. x86 has a "strong" memory ordering model: program order plus a store buffer with store-forwarding. C++ acquire and release are "free", only atomic RMWs and seq_cst stores need barriers.

ARM has a "weak" memory ordering model: only C++ memory_order_consume (data dependency ordering) is "free", acquire and release require special instructions (like ldar / stlr) or barriers.

answered Sep 19 '20 at 15:44

Peter Cordes

328,167
45
605
847

So, are memory barriers needed if the x86 cpu, which has no cache, supports only instruction reordering? – cong Sep 20 '20 at 12:23
2

@cong: If it still has a store buffer, then yes, you'd need an `mfence` barrier (or locked instruction) for sequential consistency. NT stores would be mostly pointless so presumably there'd be no use for `sfence`. Most code doesn't need that, acq_rel is fine. Only stuff like double-checked locking matters. Cache is irrelevant for needing memory barriers, as I said it's coherent. – Peter Cordes Sep 20 '20 at 13:04
So, if a cpu has no store buffer, there is no worry about memory reordering, and so, memory barriers are not needed, no matter what this cpu supports instruction reordering or not? Store buffer is the only thing that matters? – cong Sep 20 '20 at 16:06
2

@cong: Hmm, I guess it would still be possible for an x86 CPU with no store buffer to allow out-of-order *execution* of stores after later loads. A CPU like that would be totally impractical; a store buffer is a cheap way to gain tons of performance. OoO exec without decoupling exec from cache/memory would be a huge waste of resources in a memory model that allows any memory reordering. Being x86 would mean OoO exec was constrained by the strong memory ordering rules, without a store buffer to decouple exec from visibility. In real life, even simple in-order CPUs have store buffers. – Peter Cordes Sep 20 '20 at 16:35
3

@cong: Also, such a CPU would have to prove that no intervening instructions could possibly fault before executing the store. Once a store becomes globally visible, you can't "take it back" if you detect mis-speculation. The normal way of doing OoO exec wouldn't work; it relies on the store buffer to keep non-retired (speculative) stores private until it's known for sure that no earlier instruction faulted. (i.e. when the store instruction retires from the ROB, the store-buffer entry "graduates" and is eligible to commit to L1d cache. Before retirement, everything is treated as speculative) – Peter Cordes Sep 20 '20 at 16:37
1

AMD even apparently has a speculative post-retirement store buffer, allowing rollbacks even after retirement (but before visibility) to support fast atomics. – BeeOnRope Sep 20 '20 at 21:07
@PeterCordes, where do you find the x86 architecture information concerning cpu, cache, memory, store buffer, invalidate queue, etc. I only found the insturction reference manual which doesn't give information I want. – cong Sep 21 '20 at 12:39
1

@cong: https://agner.org/optimize/, articles like http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ and https://www.realworldtech.com/sandy-bridge/ (and RWT forums), sometimes Intel's optimization manual although it's kind of a mess. Also the existence of some performance counter events (and their names/descriptions) give hints: `perf list` on Linux, or https://oprofile.sourceforge.io/docs/intel-sandybridge-events.php. Given those sources of cpu-architecture implementation details, we can reason about how those pieces must fit together to implement a CPU that follows x86 ISA rules. – Peter Cordes Sep 21 '20 at 15:19
@cong: For memory ordering specifically, https://stackoverflow.com/tags/x86/info also has some links about that. I forget exactly where I read about the standard design of CPUs doing "local" reordering in their accesses to a coherent shared state maintained by [MESI](https://en.wikipedia.org/wiki/MESI_protocol). Keeping speculation local to a single core (so it can roll itself back without having to disturb other cores on mis-speculation) is a fairly universal design strategy that makes obvious sense. – Peter Cordes Sep 21 '20 at 15:25

Are memory barriers needed because of cpu out of order execution or because of cache consistency problem?

1 Answers1

Linked