What exactly is the problem that memory barriers deal with?

Question

I'm trying to wrap my head around the issue of memory barriers right now. I've been reading and watching videos about the subject, and I want to make sure I understand it correctly, as well as ask a question or two.

I start with understanding the problem accurately. Let's take the following classic example as the basis for the discussion: Suppose we have 2 threads running on 2 different cores

This is pseudo-code!

We start with int f = 0; int x = 0; and then run those threads:

# Thread 1

while(f == 0);

print(x)

# Thread 2 

x = 42;
f = 1;

Of course, the desired result of this program is that thread 1 will print 42.

NOTE: I leave "compile-time reordering" out of this discussion, I only want to focus on what happens in runtime, so ignore all kinds of optimizations that the compiler might do.

Ok so from what I understand the problem here is what is called "memory reordering": the CPU is free to reorder memory operations as long as the end result is what the program expects it to be. In this case, within thread 2, the f = 1 may be executed before x = 42. In this case, thread 1 will print 0, which is not what the programmer want.

At this point, Wikipedia points at another possible scenario that may occur:

Similarly, thread #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked

Since we're talking right now about "out of order execution" - let's ignore for a moment from the caches of the cores. So let's analyze what happens here. Start with thread 2 - the compiled instructions will look (in pseudo-assembly) something like:

1 put 42 into register1
2 write register1 to memory location of x
3 put 1 into register 2
4 write register2 to memory location of f

Ok so I understand that 3-4 may be executed before 1-2. But I don't understand the equivalent in thread 1:

Let's say the instructions of thread 1 will be something like:

1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2

What exactly may be out of order here? 3 can be before 1-2?

Let's go on: Up until now we talked about out-of-order execution, which brings me to my primary confusion:

In this great post the author describes the problem as such: Each core has its own cache, and the core does the memory operations against the cache, not against the main memory. The movement of memory from the core-specific caches to the main memory (or a shared cache) occurs in unpredictable time and order. So in our example - even if thread 2 will execute its instructions in-order - the writing of x=42 will occur before f=1, but that will be only to the cache of core2. The movement of these values to a shared memory may be in the opposite order, and hence the problem.

So I don't understand - when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?

i dont understand the question because you already put it in simple words "the f = 1 may be executed before x = 42". In thread 2 it doesnt matter if first `1` is stored or first `42` hence it can be reordered — 463035818_is_not_an_ai, Mar 28 '22 at 08:50
About thread 1 executing out of order: Assuming no other thread interferes with `x`, would the programme behaviour change if we loaded `x` into register 2 before step 1 or 2??? — Aconcagua, Mar 28 '22 at 08:51
Is this `f = 1;` pseudo-code for `std::atomic` with `memory_order_relaxed`? like `x.store(1, std::memory_order_relaxed)`? You didn't specify types for `x` and `f`, and your code is totally broken beyond just ordering with plain `int`, but doesn't need any extra C++ barriers with plain `=` on a std::atomic type, because the default memory order is `seq_cst`. Or are you talking about the asm barriers a C++ compiler might have to use to implement the ordering semantics of the C++ code? — Peter Cordes, Mar 28 '22 at 09:32
Or are you reading examples that assume you're hand-rolling your own atomics with `volatile`? [Don't do that](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118), although it is possibly useful to understand what happens under the hood in that case. (But not for reasoning about how std::atomic works.) — Peter Cordes, Mar 28 '22 at 09:35
@PeterCordes - in the case of Visual Studio C | C++, volatile will generate memory fence operations. Most (or all?) other C | C++ compilers do not do this. — rcgldr, Mar 28 '22 at 23:26
@rcgldr: Not by default on current versions of MSVC; the default now seems to be `/volatile:iso` instead of `ms`, not even giving acq/rel ordering on AArch64. (And warns about that with `-Wall`: https://godbolt.org/z/Yeq848nEe). But yes, historically MSVC did avoid compile-time reordering of other operations wrt. `volatile`, not just wrt. other volatile accesses, giving acq/rel semantics without any barrier instructions thanks to x86's strongly ordered memory model. With `/volatile:ms` on AArch64, we get stlr and ldar release-store / acquire-load: https://godbolt.org/z/cWhchcPsr — Peter Cordes, Mar 29 '22 at 02:27
@PeterCordes - I wonder why they changed it. I"m running VS2015 on a desktop with Win 7, and VS2019 on a laptop with Win 10. The desktop is multi-boot (XP, XP-X64, Win 7 Pro 64, Win 10 Pro 64), but I mostly use Win 7 since I don't want to migrate the programs I use to Win 10. — rcgldr, Mar 29 '22 at 03:50

score 3 · Answer 1 · edited Mar 28 '22 at 10:37

3

when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?

When a thread observes changes of values in a particular order, then from the programmer's perspective it is indistinguishable whether that was due to out-of-order execution of loads, a store buffer delaying stores relative to loads and possibly letting them commit out of order (regardless of execution order), or (hypothetically in a CPU without coherent cache) cache synchronization.

Or even by forwarding store data between logical cores without going through cache, before it commits to cache and becomes visible to all cores. Some POWER CPUs can do this in real life but few if any others.

Real CPUs have coherent caches; once a value commits to cache, it's visible to all cores; it can't happen until other copies are already invalidated, so this is not the mechanism for reading "stale" data. Memory reordering on real-world CPUs is something that happens within a core, with reads and writes of coherent cache possibly happening in a different order than program order. Cache doesn't re-sync after getting out of sync; it maintains coherency in the first place.

The important effect, regardless of mechanism, is that another thread observing the same variables you're reading/writing, can see the effects happen in a different order than assembly program-order.

edited Mar 28 '22 at 10:37

Peter Cordes

328,167
45
605
847

answered Mar 28 '22 at 09:34

j6t

9,150
1
15
35

Real-world CPUs all have coherent caches. A store can't commit to local L1d cache until this core has received acknowledgements from any/all other cores and caches that they've invalidated their copies of the line. See [this answer](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118) for more details about the common misconception that CPU caches can be stale. It's not caches, it's the *store buffer*, and on some CPUs other memory-reordering effects in local ordering of operations on L1d cache that create memory reordering. – Peter Cordes Mar 28 '22 at 09:36
See also [Reason for the name of the "store buffer" litmus test on x86 TSO memory model](https://stackoverflow.com/q/69112020) and [Can a speculatively executed CPU branch contain opcodes that access RAM?](https://stackoverflow.com/q/64141366) / [What's the relationship between CPU Out-of-order execution and memory order?](https://stackoverflow.com/q/70749012) / and the duplicate list on [Is memory reordering equivalent to instruction reordering?](https://stackoverflow.com/q/65684882) – Peter Cordes Mar 28 '22 at 09:40
@PeterCordes The point is, for the programmer it is quite irrelevant where re-orderings happen. Memory barriers have an influence on all of the potential sources. – j6t Mar 28 '22 at 09:46
The question asks what memory reordering is in terms of lower-level CPU-architecture things. This answer is technically true, depending what you mean by "cache synchronization" (maybe something that could happen on a hypothetical CPU without coherent caches?), because yeah the C++-level effect could be about the same as for example StoreLoad reordering caused by a store buffer. But that's not helpful for understanding how CPUs work and building a mental model that reflects reality. – Peter Cordes Mar 28 '22 at 09:52
In practice, memory reordering is almost entirely the difference between program order and access to local L1d cache. (Or [forwarding between logical cores *without* going through cache](https://stackoverflow.com/questions/27807118/will-two-atomic-writes-to-different-locations-in-different-threads-always-be-see/50679223#50679223)). On weakly-ordered CPUs, an in-order CPU with hit-under-miss capability but not true out-of-order execution can still do LoadLoad and LoadStore reordering. And the store buffer creates StoreLoad reordering. – Peter Cordes Mar 28 '22 at 09:55
So it's like in https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ that the question linked: this core's operations read from or update its own L1d cache (and thus the coherent view of memory shared by all caches) out of order for various possible reasons. If you say "cache synchronization" is why you had a read miss, or why a store can't commit right away, then this answer is true but I don't find it very helpful. – Peter Cordes Mar 28 '22 at 09:56
@PeterCordes The question asks about *movement of data across caches*. Please consider the words *cache synchronization* as a paraphrase of that. If OP (or I) were deep in hardware and CPU internals, they certainly would have used more precise terms. – j6t Mar 28 '22 at 10:09
If that's what you mean by cache synchronization, then please stop spreading misconceptions about how real CPUs work. With coherent caches, once data is in cache, it's visible to all cores, no further reordering happens. (And on most hardware memory models, but not PowerPC, going through cache is the only way for data to get between threads.) If it did happen it would be indistinguishable, but the question is asking what the actual mechanism is that produces this reordering. I think looking for a high-level summary of what actually *does* happen in real CPUs, not just hypotheticals. – Peter Cordes Mar 28 '22 at 10:22
If you want to mention cache to cache data movement as a possible source of reordering, it's IMO important to say that it's only for a hypothetical CPU without coherent caches. Otherwise you get people with wrong idea that are unhelpful as they learn more. – Peter Cordes Mar 28 '22 at 10:24
@PeterCordes, I'm kind of lost in this thread of comments, but in the bottom line - If I got you correctly, you're saying that synchronization between caches is NOT the reason of memory reordering as suggested in the "preshing" article? And, are you saying that memory reordering can occur within a single core? Because Wikipedia (and other sources) clearly state that's not true – YoavKlein Mar 29 '22 at 05:30
@YoavKlein: That's correct. Various mechanisms inside one CPU core will make the order of loads reading L1d cache (sampling from the shared cache-coherent view) and stores writing L1d cache (updating the shared cache-coherent view) differ from the program order of those operations. This results in memory reordering being visible when threads interact with each other via shared memory. (A CPU always has to preserve the illusion of a single thread running in program order, seeing its own stores.) But the mechanisms for recovering sequential consistency only involve blocking local reorderings. – Peter Cordes Mar 29 '22 at 06:02
@YoavKlein: For example, an x86 `mfence` instruction or ARM `dmb ish` just has to make sure all loads/stores before that instruction are completed (reading from cache, or writing it and still having it in MESI "modified" state) before any later loads/stores do the same. It doesn't have to push recently-stored cache lines out to other CPUs; at that point it's impossible for other cores to have stale copies of them. – Peter Cordes Mar 29 '22 at 06:05

Klaus · Answer 2 · 2022-03-28T09:06:11.497

For my point of view you missed the most important thing!

As the compiler did not see that the change of x nor f has any side effect, the compiler also can optimize all of that away. And also the loop with condition f==0 will result in "nothing" as the compiler only sees that you propagate a constant for f=0 before, it can assume that f==0 will always be true and optimize it away.

And for all of that you have to tell the compiler that there will be something happen which is not visible from the given flow of code. That can be something like a call to some semaphore/mutex/... or other IPC functionality or the use of atomic vars.

If you compile your code, I assume you get more or less "nothing" as for each of both code parts nothing has any effect and the compiler did not see that the variables are used from two thread context and optimize all and everything away.

If we implement the code as the following example, we see it fails and print 0 on my system.

int main()
{
    int f = 0; 
    int x = 0; 

    std::thread s( [&f,&x](){ x=42; f=1; } ); 

    while( f==0);
    std::cout << x << std::endl;

    s.join();
}

and if we change int f = 0; to std::atomic<int> f = 0 we get the expected result.

*did not see that the variables are used from two thread context* - Another way to put this is: the compiler is allowed to assume there's no data-race UB, so it doesn't have to try to analyze across possible threads to see how non-atomic vars are used. It can always keep them in (thread private) registers instead of store/reload to shared memory. Thus [MCU programming - C++ O2 optimization breaks while loop](https://electronics.stackexchange.com/a/387478) — Peter Cordes, Mar 28 '22 at 09:46

score 2 · Accepted Answer · answered Mar 28 '22 at 12:43

2

The two mail questions you have both have the same answer (Yes!), but for different reasons.

First let's look at this particular piece of pseudo-machine-code

Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?

To answer your question, this is a reverberating "YES!". Since the contents of register1 are not tied in any way to the contents of register2 the CPU may happily (and correctly, for that matter) preload register2, so that when the the 1,2 loop finally breaks, it can immediately go to 4.

For a practical example, register1 might be an I/O peripheral register tied to a polled serial clock, and the CPU is just waiting for the clock to transition to low, so that it can bit-bang the next value onto the data output lines. Doing it that way for one saves precious time on the data fetch and more importantly may avoid contention on the peripheral data bus.

So, yes, this kind of reordering is perfectly fine and allowed, even with optimizations turned off and happening on a single threaded, single core CPU. The only way to make sure that register2 is definitely read, after the loop breaks, is to insert a barrier.

The second question is about cache coherence. And again, the answer to the need of memory barriers is "yes! you need them". Cache coherence is an issue, because modern CPUs don't talk to the system memory directly, but through their caches. As long as you're dealing with only a single CPU core, and a single cache, coherence is not an issue, since all the threads running on the same core do work against the same cache. However the moment you have multiple cores with independent caches, their individual views of the system memory contents may differ, and some form of memory consistency model is required. Either through explicit insertion of memory barriers, or on the hardware level.

answered Mar 28 '22 at 12:43

datenwolf

159,371
13
185
298

Does the flow of data between caches and main memory is controlled by the CPU - i.e. there are some instructions performed by the CPU that makes this happen - or is it something that happens in the background without the CPU aware of this? – YoavKlein Mar 28 '22 at 13:14
@YoavKlein that depends on the particular CPU architecture in use. Some architectures depend completely on explicit memory barriers being inserted into the instruction stream to update peer caches. Other CPUs will eventually reach cache consistency without explicit synchronization, but data may exist in an incoherence state for some time. And some architectures will not exhibit cache decoherence at all, and will always keep caches consistent across cores. – datenwolf Mar 28 '22 at 13:25
`modern cpus` - all Intel X86 server processors and some core i7 and later processors ensure cache coherency across processors. All Intel multi-core processors ensure cache coherency between cores. Cache coherence is only an issue if processors that don't externally expose their cache protocols are used on a multi-processor motherboard or if external bus mastering devices don't use the cache coherence protocol and device drivers for those devices don't use the equivalent of memory fences. – rcgldr Mar 28 '22 at 18:06
@rcgldr: *all* systems that we run Linux and/or `std::thread`s across have coherent caches, including all x86 ever, and all other mainstream ISAs like POWER and AArch64. Including multi-socket systems. (On clusters with non-coherent shared memory, we use it for message-passing, not running a single OS or multi-threaded C++ program across coherency domains.) Real CPUs all use some variant of [MESI](https://en.wikipedia.org/wiki/MESI_protocol) to maintain coherency, [not needing asm barriers for other threads to see relaxed stores](https://stackoverflow.com/a/58535118/224132). – Peter Cordes Mar 28 '22 at 21:39
@rcgldr: Also, x86 has always had cache-coherent DMA, right? Since cache control instructions like `clflush` didn't exist until much later than x86 CPUs with caches. Anyway, your comment implied some i7 processors *don't* ensure cache coherency across cores, but that's not true. – Peter Cordes Mar 28 '22 at 21:42
@PeterCordes - read my comment again. Some Intel Core i7 and most older non-server Intel processors don't externally expose their variant of MESI on their pins, which is an issue only if those processors are used on a multi-processor motherboard. – rcgldr Mar 28 '22 at 23:22
@rcgldr: You're right, I kind of skimmed the later part of your comment. "processor" is ambiguous between package and core, but I guess there was barely enough info to guess you meant a whole package. Anyway, you're saying there are some core i7 CPUs that could fit in a socket on a multi-socket motherboard, and actually boot, but not be cache-coherent? Like HEDT versions that use server dies / sockets? (If you can't actually boot the cores on a 2nd socket in such a system, IDK if it's really relevant.) I think that comment kind of muddies the waters, since that wouldn't be a valid x86. – Peter Cordes Mar 29 '22 at 02:05
@PeterCordes - I don't recall which generation of Intel processors this was. but apparently a generation where Xeon and Core i7 had the same pinouts. They may still be compatible. Doing a web search for "dual i7 motherboards" shows some hits, but there was some article about which i7's this would work for. – rcgldr Mar 29 '22 at 03:52

What exactly is the problem that memory barriers deal with?

3 Answers3