1

I'm wonderring why are memory barriers needed and I have read some articles about this toppic.
Someone says it's because of cpu out-of-order execution while others say it is because of cache consistency problems which store buffer and invalidate queue cause.
So, what's the real reason that memory barriers are needed? cpu out-of-order execution or cache consistency problems? or both? Does cpu out-of-order execution have something to do with cache consistency? and what's the difference between x86 and arm?

cong
  • 1,105
  • 1
  • 12
  • 29
  • It has to do with neither specifically. They basically stop new transactions and allows transactions in flight to complete to avoid race conditions that can cause something undesirable/predictable to happen within a specific system design. Allows you to perform specific transactions into a system in a known state. – old_timer Sep 20 '20 at 00:06
  • With all the parallel things going on normally it is essentially controlled chaos, this will pause the chaos. Like stopping traffic to help a slow/elderly person across the road, and then the chaos can continue. – old_timer Sep 20 '20 at 00:10
  • 1
    Some systems will have separate instruction barriers and data barriers to handle or isolate the different areas. The places where you need them are very specific to a system that doesnt mean x86 this and arm that or cache this and pipeline that, but this specific x86 processor, this specific arm core implemented in this way needs a barrier before performing this operation. And not all x86 processors or arm cores need it in that place for that operation. They are used to prevent potential race conditions causing undesirable or unpredictable results. – old_timer Sep 20 '20 at 00:36

1 Answers1

5

You need barriers to order this core / thread's accesses to globally-visible coherent cache when the ISA's memory ordering rules are weaker than the semantics you need for your algorithm.

Cache is always coherent, but that's a separate thing from consistency (ordering between multiple operations).

You can have memory reordering on an in-order CPU. In more detail, How is load->store reordering possible with in-order commit? shows how you can get memory reordering on a pipeline that starts executing instructions in program order, but with a cache that allows hit-under-miss and/or a store buffer allowing OoO commit.

Related:


See also https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and https://preshing.com/20120930/weak-vs-strong-memory-models for some more basics. x86 has a "strong" memory ordering model: program order plus a store buffer with store-forwarding. C++ acquire and release are "free", only atomic RMWs and seq_cst stores need barriers.

ARM has a "weak" memory ordering model: only C++ memory_order_consume (data dependency ordering) is "free", acquire and release require special instructions (like ldar / stlr) or barriers.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • So, are memory barriers needed if the x86 cpu, which has no cache, supports only instruction reordering? – cong Sep 20 '20 at 12:23
  • 2
    @cong: If it still has a store buffer, then yes, you'd need an `mfence` barrier (or locked instruction) for sequential consistency. NT stores would be mostly pointless so presumably there'd be no use for `sfence`. Most code doesn't need that, acq_rel is fine. Only stuff like double-checked locking matters. Cache is irrelevant for needing memory barriers, as I said it's coherent. – Peter Cordes Sep 20 '20 at 13:04
  • So, if a cpu has no store buffer, there is no worry about memory reordering, and so, memory barriers are not needed, no matter what this cpu supports instruction reordering or not? Store buffer is the only thing that matters? – cong Sep 20 '20 at 16:06
  • 2
    @cong: Hmm, I guess it would still be possible for an x86 CPU with no store buffer to allow out-of-order *execution* of stores after later loads. A CPU like that would be totally impractical; a store buffer is a cheap way to gain tons of performance. OoO exec without decoupling exec from cache/memory would be a huge waste of resources in a memory model that allows any memory reordering. Being x86 would mean OoO exec was constrained by the strong memory ordering rules, without a store buffer to decouple exec from visibility. In real life, even simple in-order CPUs have store buffers. – Peter Cordes Sep 20 '20 at 16:35
  • 3
    @cong: Also, such a CPU would have to prove that no intervening instructions could possibly fault before executing the store. Once a store becomes globally visible, you can't "take it back" if you detect mis-speculation. The normal way of doing OoO exec wouldn't work; it relies on the store buffer to keep non-retired (speculative) stores private until it's known for sure that no earlier instruction faulted. (i.e. when the store instruction retires from the ROB, the store-buffer entry "graduates" and is eligible to commit to L1d cache. Before retirement, everything is treated as speculative) – Peter Cordes Sep 20 '20 at 16:37
  • 1
    AMD even apparently has a speculative post-retirement store buffer, allowing rollbacks even after retirement (but before visibility) to support fast atomics. – BeeOnRope Sep 20 '20 at 21:07
  • @PeterCordes, where do you find the x86 architecture information concerning cpu, cache, memory, store buffer, invalidate queue, etc. I only found the insturction reference manual which doesn't give information I want. – cong Sep 21 '20 at 12:39
  • 1
    @cong: https://agner.org/optimize/, articles like http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ and https://www.realworldtech.com/sandy-bridge/ (and RWT forums), sometimes Intel's optimization manual although it's kind of a mess. Also the existence of some performance counter events (and their names/descriptions) give hints: `perf list` on Linux, or https://oprofile.sourceforge.io/docs/intel-sandybridge-events.php. Given those sources of cpu-architecture implementation details, we can reason about how those pieces must fit together to implement a CPU that follows x86 ISA rules. – Peter Cordes Sep 21 '20 at 15:19
  • @cong: For memory ordering specifically, https://stackoverflow.com/tags/x86/info also has some links about that. I forget exactly where I read about the standard design of CPUs doing "local" reordering in their accesses to a coherent shared state maintained by [MESI](https://en.wikipedia.org/wiki/MESI_protocol). Keeping speculation local to a single core (so it can roll itself back without having to disturb other cores on mis-speculation) is a fairly universal design strategy that makes obvious sense. – Peter Cordes Sep 21 '20 at 15:25