Volatile vs Memory Barriers

Question

is it possible to achieve the same "guarantees" of a volatile variable (always read/write to memory instead of a register) with memory barriers? Just writing a variable in one thread an reading its value in another thread. Is the following equivalent?

#define rmb() __sync_synchronize()
#define wmb() __sync_synchronize()
static volatile int a;
static int b;

//invoked by thread 1
static void writer(void)
{
    ...
    a = 1;
    ...
    b = 1;
    wmb();
}

//invoked by thread 2
static void reader(void)
{
    ...
    while (a != 1)
        ;
    // do something
    ...
    while (b != 1)
        rmb();
    // do something
}

Edit: Ok I get that volatile doesn't guarantee atomicity, visibility or ordering. Do memory barriers provide anything more than just ordering? Also visibility? Is there anything else than _Atomic C11 or gcc/clang atomic builtins for visibility guarantees?

It does you no good to ensure memory is coherent if the compiler is keeping a value in a register and does not reread it from memory. — Eric Postpischil, Mar 15 '23 at 21:20
Visual Studio will optionally (compiler flag) use memory barriers for volatile variables. I'm not aware of another compiler that does this. — rcgldr, Mar 15 '23 at 21:26
@Markus Fuch, `while (b != 1) ;` looks like an infinite loop candidate. — chux - Reinstate Monica, Mar 15 '23 at 21:41
@rcgldr for x86 it is an default behaviour of the MS compiler — 0___________, Mar 15 '23 at 21:45
@MarkusFuchs Once code reads `b` to evaluate `while (b != 1) ;` no need to read it again. `b` is not _volatile_, so code should be able to assume it does not change. — chux - Reinstate Monica, Mar 15 '23 at 21:47
`while (b != 1)` if nothing in the loop changes `b` then it might be an infinitive loop and memory marriers will not change it, — 0___________, Mar 15 '23 at 21:51
This question seems to suppose that `volatile` has some role to play in C multithreaded programming. [It does not.](https://stackoverflow.com/q/2484980/2402272) — John Bollinger, Mar 15 '23 at 21:53
@JohnBollinger it definitely *does* but it is far not sufficient. — 0___________, Mar 15 '23 at 22:06
@0___________, not only is `volatile` not sufficient, it also is not necessary when standard C mechanisms (or alternatively, standard POSIX mechanisms) that *are* sufficient are used. In my book, that's "no role to play". — John Bollinger, Mar 15 '23 at 22:29
`wmb()` and `rmb()` are not standard C functions. It's not clear whether you mean specific functions of those names (for example, those of the Linux kernel) or something more generic. — John Bollinger, Mar 15 '23 at 22:45
@JohnBollinger yeah wmb() and rmb() are the names in the Linux Kernel. I probably would use gcc builtin __sync_synchronize() in userspace. I edit the question. — Markus Fuchs, Mar 15 '23 at 22:57
@JohnBollinger After reading more, I think I understand now that volatile does not give the required guarantees for "visibility" in a multi-threaded / multi-core environment. Can the guarantee for visibility be provided by using memory barriers, or do they only provide order guarantees? Or I need to use e.g. C11 _Atomic for visibility and not just atomicity. — Markus Fuchs, Mar 15 '23 at 23:02
@MarkusFuchs: The more serious issue is that concurrently reading and writing anything other than an `_Atomic` is a data race and makes your entire program have undefined behavior, at which point questions like visibility are moot. It does not matter whether it is `volatile`. (There is a special legacy exception for `volatile sig_atomic_t` when accessed by a signal handler within the same thread.) — Nate Eldredge, Mar 15 '23 at 23:08
@MarkusFuchs: Keep in mind that the Linux kernel memory model is different from C11's and relies on assumptions about compiler behavior that are specific to gcc. This in effect makes it a separate dialect of C. So you should probably begin by being clear what dialect you are actually coding in, so that we know what set of rules to apply. — Nate Eldredge, Mar 15 '23 at 23:15
@NateEldredge "_is a data race and makes your entire program have undefined behavior, at which point questions like visibility are moot. It does not matter whether it is volatile_" Nope. There is no problem if you use `volatile` on a type that is naturally atomic on the given scalar on the target arch. That's *the whole point of volatile.* There are many other issues with `volatile` for MT but not that one. — curiousguy, Mar 26 '23 at 05:05
@curiousguy: The C11/C17 standard doesn't agree with you. 5.1.2.4 p35: "The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior." And "atomic actions" only include operations on `_Atomic` types; you will not find anything in the standard to tell you that an assignment to a `volatile int` is an atomic operation. So it does not get any exception from that rule. — Nate Eldredge, Mar 26 '23 at 05:14
@curiousguy: Getting atomic operations *used* to be the "whole point" of `volatile`, prior to C11 when there was no alternative. Since C11 that has all changed, and now that role is played by `_Atomic`. Now, you are very likely right *in practice* that it will work, today, since `volatile` operations may take advantage of the platform's atomic guarantees and also suppress the compiler optimizations that could defeat it. But this is a [tag:c] question so we're concerned with what the language specifies, not what particular implementations may do. — Nate Eldredge, Mar 26 '23 at 05:18
@NateEldredge "_The C11/C17 standard doesn't agree with you._" 1) What else does the C/C++ std say re: `volatile`? 2) Can a compiler make any assumption on the value of a volatile scalar, ever? Can any write to a volatile be optimized out? 3) Can you explain what assumptions/optimizations/transformations are possible on these? 4) How can the behavior of such program not be fully determined by the arch/CPU? _tl;dr_ If you understand the intent and meaning of volatile, you must conclude I'm right. (Also, don't get me started on the fact C/C++ don't provide any semantics for MT programs.) — curiousguy, Mar 26 '23 at 05:34
@NateEldredge "_Getting atomic operations used to be the "whole point" of volatile, prior to C11 when there was no alternative._" Really? Where did you get that? For MT programming, `volatile` always did suck and still does. No portable RMW operation, not even the lame test-and-set. Also, a variable either is volatile or it isn't. You can't make it sometimes volatile (not even when turning off interrupts if you program at a _very_ low level). There are many uses of volatile that don't involve MT, interrupt, signals nor atomicity. — curiousguy, Mar 26 '23 at 05:40
@curiousguy: Atomic loads and stores, I mean, just as you said. Yes, of course you don't have atomic RMW unless you write them in assembly. I think we agree that *in practice*, `volatile` objects of architecture-appropriate type provide you similar semantics to `relaxed` loads and stores of `_Atomic` variables: no tearing, no caching in registers, a well-defined modification order. Prior to C11, that was the best you could do, combined with rolling your own atomic RMWs and memory barriers. And implementations would tend to make it work, as there was no alternative. — Nate Eldredge, Mar 26 '23 at 17:42
@curiousguy: But post-C11, there *is* a portable alternative (namely `_Atomic`). So if some compiler someday does come up with an optimization that maintains `volatile` semantics for things like memory-mapped I/O, but causes bad behavior in case of `volatile` data races, I don't think they will have so much reluctance to implement it, as they would have pre-C11. So even though it may be all right today, for future compatibility I'd be hesitant to rely on *any* well-defined behavior from `volatile` in the context of multithreading. — Nate Eldredge, Mar 26 '23 at 17:45
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/252794/discussion-between-nate-eldredge-and-curiousguy). — Nate Eldredge, Mar 26 '23 at 17:47

Peter Cordes · Accepted Answer · 2023-03-15T23:32:26.620

No, barriers alone aren't sufficient. You need to use volatile as well for access to shared data if you want to roll your own atomics the way the Linux kernel does, relying on behaviour of a few known compilers (instead of portable ISO C stuff like <stdatomic.h>).

In GNU C, a compiler memory barrier like asm("" ::: "memory") can force the compiler to make asm that accesses a non-volatile int again instead of just keeping its value in a register. So for example you might think you could put that inside the while (b != 1) asm(""::"memory") spin loop to force re-reads of b, unlike your broken code which will optimize like if(b!=1){ while(true){} }

But that's not the only thing you have to worry about: without volatile, the compiler is allowed to invent reads. e.g. if you do int tmp = shared_var; and then use tmp multiple times, the compiler might decide that it's cheaper to just reload the shared_var again for one of the later uses. So your program might act like tmp changed value, leading to inconsistent behaviour.

See the LWN article Who's afraid of a big bad optimizing compiler? for that and many more possible problems; it explains why Linux code needs WRITE_ONCE / READ_ONCE or ACCESS_ONCE macros to do a volatile access to a shared variable. If you're rolling your own version of that, you don't need the variable itself to be declared volatile; it's sufficient to do *(volatile int*)&b = 1;. (That lets you efficiently access it in phases of your program where it's not shared. Or for a struct, when an instance of that struct isn't shared.)

GCC and clang do at least de-facto define the behaviour of volatile for multi-threading use-cases (to something like _Atomic with memory_order_relaxed), to support the Linux kernel and legacy code written before C++11 / C11 stdatomic gave us a portable well-defined way to do all this stuff. Normally you should just use stdatomic.h functions in C, or the GNU C __atomic builtins. (Or the obsolete __sync builtins if you insist). But volatile does still work on many known implementations, like GCC, clang, ICC, and MSVC, if you know exactly what you're doing and get everything right.

volatile makes GCC do the access with a single full-width access if it can, e.g. for the example of non-atomic stp of int64_t on AArch64 with a constant where high half = low half, GCC generates the full 64-bit value in one register if you use volatile, so you get atomicity when it can happen for free with a type of that width. (It doesn't go out of its way to do 64-bit atomicity on 32-bit targets, though, unlike __atomic_load_n which will use SSE2 or MMX to do a 64-bit load on 32-bit x86.)

ISO C doesn't guarantee anything about volatile; data-race UB still applies to it. But all real-world systems where we run multiple threads of the same process on multiple cores have cache-coherent shared memory between those cores, so volatile forcing a load or store to happen in the asm does give visibility. No run-time ordering, though.

BTW, __sync_synchronize is a very expensive definition for a read-memory-barrier or write-memory-barrier (acquire/release fences that don't need to block StoreLoad reordering). __sync_synchronize is actually a full barrier, like atomic_thread_fence(memory_order_seq_cst);

Isn't it UB if `volatile` object is used more than one time in a single expression? — 0___________, Mar 15 '23 at 23:36
@0___________: Oh, that sounds familiar, like `int tmp = a + a;` is two unsequenced reads of the same object, or worse `tmp = a + b` doesn't tell the compiler what order you want reads of two *different* objects to happen. So that's another difference from `_Atomic int`. Was that a recent change to the ISO standard? Anyway, https://godbolt.org/z/Wbazbc67d shows GCC doesn't diagnose anything for `a+b+a` even with ` -Wall -Wextra -Wpedantic`, in either C2x mode or C++20 mode. — Peter Cordes, Mar 15 '23 at 23:45
@PeterCordes: https://stackoverflow.com/questions/75247233/can-volatile-variables-be-read-multiple-times-between-sequence-points. Apparently in C99 `a+a` or `a-a` was valid but unspecified which access occurred first (for addition that would be irrelevant). From C11 it apparently became UB. — Nate Eldredge, Mar 16 '23 at 00:00
@PeterCordes What I really don't get is what the usage of a mutex changes. If a non-atomic / non-volatile variable is placed in between a mutex_lock/mutex_unlock pair all of a sudden visibility / atomicity / ordering works. In the end a mutex call must also be some kind of hardware instruction but obviously not just memory barriers. — Markus Fuchs, Mar 16 '23 at 05:14
@MarkusFuchs: A mutex creates a happens-before relationship between threads so there *isn't* simultaneous unsynchronized access to the non-atomic non-volatile variables. `mutex_lock` isn't *just* a local memory barrier like `atomic_thread_fence(acquire)`, it also doesn't allow execution to come out the other side until this thread has exclusive ownership of the mutex. — Peter Cordes, Mar 16 '23 at 05:28
(Taking a mutex is actually an acquire *operation*, [not guaranteed to be an acquire *fence*](https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/), but correct usage of mutexes does ensure serialization between threads and basically sequentially-consistent execution of a program that doesn't doesn't have data race UB.) — Peter Cordes, Mar 16 '23 at 05:30
@MarkusFuchs: `atomic_store()` and `atomic_load()` can also create happens-before relationships between threads (with memory order seq_cst or release/acquire). You can also hand-roll that synchronization with `volatile` and manual memory barriers to make it safe to share non-volatile data, but you can't safely have unsynchronized access to non-volatile data. (Except in something like a SeqLock where you *detect* tearing and retry to read; that's UB according to the ISO C standard unless you use atomic `relaxed` for the pieces of the payload, but in practice fine.) — Peter Cordes, Mar 16 '23 at 05:53

0___________ · Answer 2 · 2023-03-15T23:32:59.333

0

volatile will make sure that the compiler will not reorder or skip memory accesses.
memory barriers make sure that your system (hardware - processor, memory system, busses etc) will not reorder writes or reads.

It is something completely different and some compilers (like Microsoft one) add memory barriers when accessing volatile objects (except ARM and other weak memory-ordering architectures - but you can force this behaviour by using compiler flags)

edited Mar 15 '23 at 23:32

answered Mar 15 '23 at 21:43

0___________

60,014
4
34
74

`volatile` operations are ordered (at compile time) with each other, but not guaranteed not to reorder with surrounding non-`volatile` accesses. They do have to happen in the right order and the right number of times, so they can't sink out of loops, but you don't get acquire/release semantics from them (except in MSVC with a command line option to enable that old behaviour). `volatile` also doesn't avoid runtime reordering so it's a bit like stdatomic with `memory_order_relaxed`. (Again except on MSVC with the MS option that gives it acquire/release semantics.) – Peter Cordes Mar 15 '23 at 23:29
@PeterCordes that's exactly what I wrote maybe in a bit abbreviated form – 0___________ Mar 15 '23 at 23:32
You wrote that volatile will stop the compiler from reordering memory accesses. But you didn't say with respect to what. It's only wrt. other `volatile` accesses that you can count on that compile-time ordering. I see I did repeat some of what you said, though, like the distinction between compile-time vs. run-time ordering. – Peter Cordes Mar 15 '23 at 23:35
@PeterCordes your answer is going very deep into the topic. I never had such ambition :) UVted. – 0___________ Mar 15 '23 at 23:39
MSVC doesn't (or didn't) reorder `volatile` with other operations, even non-volatile ones, which is why on x86 it "worked" as acquire/release. This doesn't require any run-time barrier instructions, only compile-time ordering. So opposite of what you said, the only time actual fence instructions will appear in the asm because of `volatile` is on weakly-ordered ISAs when you compile with MSVC using `/volatile:ms` (https://learn.microsoft.com/en-us/cpp/build/reference/volatile-volatile-keyword-interpretation?view=msvc-170). AFAIK, MSVC is the only compiler that has an option to work this way – Peter Cordes Mar 15 '23 at 23:39
@PeterCordes it is exactly as I wrote: *"/volatile:iso Selects strict volatile semantics as defined by the ISO-standard C++ language. Acquire/release semantics are not guaranteed on volatile accesses. If the compiler targets **ARM** (except ARM64EC), this is the default interpretation of volatile."* – 0___________ Mar 15 '23 at 23:42
@PeterCordes *"/volatile:ms Selects Microsoft extended volatile semantics, which add memory ordering guarantees beyond the ISO-standard C++ language. Acquire/release semantics are guaranteed on volatile accesses. However, this option also forces the compiler to generate hardware memory barriers, **which might add significant overhead on ARM and other weak memory-ordering architectures**. If the compiler targets ARM64EC or any **non-ARM** platform, this is default interpretation of volatile."* – 0___________ Mar 15 '23 at 23:44
You wrote that MSVC will *add memory barriers when accessing volatile objects (except ARM*. That's the opposite of what actually happens: on x86, acquire and release happen for free. The hardware is always effectively doing those barriers so MSVC doesn't have to add any. Only order at compile time. I guess it depends on how you define "barrier". If you mean a logical / abstract fence like `atomic_thread_fence(memory_order_acq_rel)`, then yes, MSVC can "add" that and optimize it to zero instructions for x86-64. – Peter Cordes Mar 15 '23 at 23:49
https://godbolt.org/z/9fz1nefMM – 0___________ Mar 16 '23 at 00:00
I had thought modern MSVC with `-volatile:iso` might be able to do dead-store elimination of a non-`volatile` object on either side of a `volatile` access, but it seems maybe not. https://godbolt.org/z/Mqa76b5Ws shows GCC doing the optimizations I expected, but MSVC `-volatile:iso` still treating the `volatile` like a compiler barrier (like GNU C `asm(""::"memory")` or just missing the optimization, even for AArch64 where that doesn't help with run-time ordering. What you wrote also implies that MSVC always gives acquire/release ordering for `volatile` on x86, but the docs don't guarantee it – Peter Cordes Mar 16 '23 at 00:00
Yes, I know what asm `-volatile:ms` vs. `-volatile:iso` makes on AArch64. https://godbolt.org/z/9EKEr5WrK is a clearer example; more optimization avoids store/reload of a temporary to the stack! Anyway, `ldar` / `stlr` isn't a fence, it's ordering for the halves of that operation. An acq_rel operation isn't an acq_rel *fence* that can order other instructions wrt. each other. Also, I was objecting to your phrasing that some compilers *always* "add barriers" to `volatile` on strongly-ordered memory model machines (i.e. x86-64) where no special insns are required. – Peter Cordes Mar 16 '23 at 00:09
I was mis-remembering, though, I thought they'd made `-volatile:iso` the default on x86/x86-64, but no they still document `-volatile:ms` as the default. So MSVC does by default add ordering semantics to `volatile` on platforms except ARM, although though this only needs fence instructions on non-x86 ISAs like PowerPC or MIPS. – Peter Cordes Mar 16 '23 at 00:12

Volatile vs Memory Barriers

2 Answers2

Linked