3

In a multi-producer, multi-consumer situation. If producers are writing into int a, and consumers are reading from int a, do I need memory barriers around int a?

We all learned that: Shared resources should always be protected and the standard does not guarantee a proper behavior otherwise.

However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.

Therefore, why protect int a at all?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Kam
  • 5,878
  • 10
  • 53
  • 97
  • Visibility is not ensured automatically; for example, the compiler might keep the value for (int a) solely in a register, in which case changes to that value would never make it into any cache at all, let alone a cache on a different core or CPU. – Jeremy Friesner Nov 19 '14 at 03:45
  • @Jeremy This post tends to disagree http://www.1024cores.net/home/lock-free-algorithms/so-what-is-a-memory-model-and-how-to-cook-it/visibility – Kam Nov 19 '14 at 03:47
  • Read @JeremyFriesner's comment more carefully. That link is correct as far as it goes, but simply doesn't apply in the case he's pointing out. *If* the data is being written to/read from a *memory location*, then the cache controller will assure coherence. If, however, it's assigned to a register, the cache controller has *no* relevance to it at all--the cache controller only deals with memory, *not* registers. – Jerry Coffin Nov 19 '14 at 03:53
  • @Jerry So you're saying that if `thread1` is writing to `int a` and `thread2` tries to read `int a`, there's a possibility that `thread2` won't be able to see what `thread1` wrote? – Kam Nov 19 '14 at 03:57
  • OMG how did I not know that.... – Kam Nov 19 '14 at 04:05
  • Also reading and writing the value may require more than one cpu instruction; which makes it non-atomic. – seand Nov 19 '14 at 04:19

2 Answers2

7

At least in C++11 (or later), you don't need to (explicitly) protect your variable with a mutex or memory barriers.

You can use std::atomic to create an atomic variable. Changes to that variable are guaranteed to propagate across threads.

std::atomic<int> a;

// thread 1:
a = 1;

// thread 2 (later):
std::cout << a;    // shows `a` has the value 1.

Of course, there's a little more to it than that--for example, there's no guarantee that std::cout works atomically, so you probably will have to protect that (if you try to write from more than one thread, anyway).

It's then up to the compiler/standard library to figure out the best way to handle the atomicity requirements. On a typical architecture that ensures cache coherence, it may mean nothing more than "don't allocate this variable in a register". It could impose memory barriers, but is only likely to do so on a system that really requires them.

On real world C++ implementations where volatile worked as a pre-C++11 way to roll your own atomics (i.e. all of them), no barriers are needed for inter-thread visibility, only for ordering wrt. operations on other variables. Most ISAs do need special instructions or barriers for the default memory_order_seq_cst.

On the other hand, explicitly specifying memory ordering (especially acquire and release) for an atomic variable may allow you to optimize the code a bit. By default, an atomic uses sequential ordering, which basically acts like there are barriers before and after access--but in a lot of cases you only really need one or the other, not both. In those cases, explicitly specifying the memory ordering can let you relax the ordering to the minimum you actually need, allowing the compiler to improve optimization.

(Not all ISAs actually need separate barrier instructions even for seq_cst; notably AArch64 just has a special interaction between stlr and ldar to stop seq_cst stores from reordering with later seq_cst loads, on top of acquire and release ordering. So it's as weak as the C++ memory model allows, while still complying with it. But weaker orders, like memory_order_acquire or relaxed, can avoid even that blocking of reordering when it's not needed.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • *An atomic basically acts like you've inserted barriers before and after access* - you mean because `memory_order_seq_cst` is the default ordering for `std::atomic`? You need to use `a.store(1, std::memory_order_release)` if you want it to be cheaper. – Peter Cordes Jul 20 '22 at 16:53
  • If you're talking about non-`atomic` variables, compiler-barriers don't create atomicity. Sometimes you get lucky, but [Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes](https://stackoverflow.com/a/71867102) shows a counterexample of `*u64_ptr = 0xdeadbeefdeadbeef;` on AArch64 being done in two halves. See also [Who's afraid of a big bad optimizing compiler?](https://lwn.net/Articles/793253/) re: rolling your own atomics with barriers and `volatile` (in the Linux kernel, which only has to compile with GCC or clan – Peter Cordes Jul 20 '22 at 16:58
  • Compiler barriers like `asm("" ::: "memory");` can create inter-thread *visibility*, but with no guarantee of atomicity. C++11 fences like `atomic_thread_fence(memory_order_release);` would typically include a compiler barrier at least as an implementation detail, and probably works in practice on most implementations to make a plain `int` visible. (But in terms of standards compliance, it doesn't avoid data-race UB if you don't have synchronization.) – Peter Cordes Jul 20 '22 at 17:58
  • "you mean because memory_order_seq_cst is the default ordering for std::atomic?" Yes, exactly. To me, calling an `asm` block a "compiler barrier" seems like kind of poor terminology, but if everybody agrees that's the right term, I guess it doesn't matter. – Jerry Coffin Jul 20 '22 at 18:11
  • In GNU C, a `"memory"` clobber forces the compiler to have memory in sync with the C abstract machine, at least for any objects that are potentially globally reachable (escape analysis can let it still keep local vars in registers). One use-case is to force compile-time ordering (aka "compiler barrier"), like `atomic_signal_fence` as opposed to `atomic_thread_fence`. Preventing compile-time reordering gives you the ordering semantics of whatever target ISA you're compiling for. https://preshing.com/20120625/memory-ordering-at-compile-time/ uses the term "compiler barrier" this way. – Peter Cordes Jul 20 '22 at 18:16
  • Anyway, I find that part of your answer strange. If you're talking about weakening the ordering for `std::atomic x` stores and loads, you don't do that with explicit barriers or lack thereof. Instead you use `x.store(1, std::memory_order_relaxed);` You don't want to be using explicit barriers in the C++ source. – Peter Cordes Jul 20 '22 at 18:20
  • You can *think of* it in terms of a model where your C++ compiles to barrer/store/barrier, but that's not how AArch64 works, for example. seq_cst uses `stlr` / `ldar`, which can't reorder *with each other*, but otherwise `stlr` is a release-store that can reorder with later loads. An SC store does *not* have to stall this thread until it drains the store buffer. On x86 that's the best we can do, but AArch64 is a better match for the C++11 memory model, being as strong as necessary without being much stronger. – Peter Cordes Jul 20 '22 at 18:21
  • @PeterCordes: Maybe I should do a bit of rewriting to make it more clear, but that's exactly what I was talking about--explicitly specifying the ordering you want (relaxed, acquire or release). – Jerry Coffin Jul 20 '22 at 18:21
  • Ok, using weaker-than-SC *operations* is very different from using explicit *barriers*. I think most people would read that as I did, as talking about `atomic_thread_fence(release)`. (Specifically talking about the phrasing: *In those cases, explicit memory barriers can let you remove the barriers you don't need.* - It seems you meant explicit memory **orders**; just a one word difference (in 3 places), but a very important word!) – Peter Cordes Jul 20 '22 at 18:23
  • 1
    @PeterCordes: I've done a bit of editing. I'm not sure I'm happy with the wording as it is, but I don't have any more time to spend on it right now (have a demo for work in about 3 hours). – Jerry Coffin Jul 20 '22 at 19:12
  • Yeah, it's pretty good now. The 2nd-last paragraph suggests that barriers might be needed even for inter-thread visibility, which is hypothetically possible but all real C++ implementations only start threads across cores with coherent cache. (Unless there are some GPU implementations?) That's [why `volatile` worked as a way to roll your own before C++11](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118). – Peter Cordes Jul 20 '22 at 19:37
  • The 2nd-last para suggested that normally no barriers are needed for atomic vars, but then the last paragraph says they are (because of seq_cst). So yeah, they kind of clash with each other in the simplifications they're making. I made an edit to take a stab at it; I hope I haven't over-complicated it over the heads of the intended audience. – Peter Cordes Jul 20 '22 at 19:44
5

However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.

Unless you strictly adhere to the requirements of the C++ spec to avoid data races, the compiler is not obligated to make your code function the way it appears to. For example:

int a = 0, b = 0; // shared variables, initialized to zero

a = 1;
b = 1;

Say you do this on your fully cache-coherent architecture. On such hardware it would seem that since a is written before b no thread will ever be able to see b with a value of 1 without a also having that value.

But this is not the case. If you have failed to strictly adhere to the requirements of the C++ memory model for avoiding data races, e.g. you read these variables without the correct synchronization primitives being inserted anywhere, then your program may in fact observe b being written before a. The reason is that you have introduce "undefined behavior" and the C++ implementation has no obligation to do anything that makes sense to you.

What may be going on in practice, is that the compiler may reorder writes even if the hardware works very hard to make it seem as if all writes occur in the order of the machine instructions performing the writes. You need the entire toolchain to cooperate, and cooperation from just the hardware, such as strong cache coherency, is not sufficient.


The book C++ Concurrency in Action is a good source if you care to learn about the details of the C++ memory model and writing portable, concurrent code in C++.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
bames53
  • 86,085
  • 15
  • 179
  • 244
  • Related: [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/a/36685056) - as you say, you need `std::atomic` to take advantage of the hardware capabilities. You can use `std::memory_order_relaxed` to get cheap loads/stores without barriers (or acquire and release on x86 are also "free") but without optimizing them into registers. – Peter Cordes Jul 20 '22 at 16:51