C - volatile and memory barriers in lockless shared memory access?

Question

Hi I had a general question regarding usage of volatile and memory barriers in C while making memory changes in shared memory being concurrently accessed by multiple threads without locks. As I understand volatile and memory barriers serve the following general purposes

memory barriers

A) make sure all pending memory accesses (read/writes(depending on the barrier)) have been properly completed before the barrier and only then the memory accesses following the barrier are executed.

B) Make sure that the compiler does not reorder load/store instructions(depending on the barrier) across the barriers.

Basically the purpose of point A is to handle out of order execution and write buffer flush delay scenarios where the processor itself ends up reordering instructions generated by the compiler OR memory accesses made by the said instructions. The purpose of the point B is that when C code is translated to machine code the compiler does not itself move those accesses in assembly around.

Now for volatile volatile is basically meant in a loose way so that so that the compiler does not perform its optimisations while optimising code written with volatile variables. The following broad purposes are served

A) memory accesses are not cached in cpu registers when translating C code to machine level code and every time a read in code happens it’s converted into a load instruction to be done through the memory in assembly.

B) relative order of memory accesses in assembly with other volatile variables are kept in the same order when the compiler transforms C code to machine code while the memory accesses in assembly with non volatile variables can be interleaved.

I have the following questions

is my understanding correct and complete ? Like are there cases I am missing or something I am saying incorrect.
so then whenever we are writing code making memory changes in shared memory being concurrently accessed by multiple threads we need to make sure we have barriers so that behaviour corresponding to point 1.A and 1.B doesn’t happen. The behaviour corresponding to 2.B will be handled by 1.B and for 2.A we need to cast our pointer to a volatile pointer for the access. Basically I am trying to understand should we always be casting the pointer to a volatile pointer and then making the memory access so that we are sure 2.A doesn’t happen or are there are cases where only using barriers suffice ?

score 1 · Accepted Answer · answered Feb 16 '22 at 06:10

is my understanding correct and complete ?

Yeah, it looks that way, except for not mentioning that C11 <stdatomic.h> made all this obsolete for almost all purposes.

There are more bad/weird things that can happen without volatile (or better, _Atomic) that you didn't list: the LWN article Who's afraid of a big bad optimizing compiler? goes into detail about things like inventing extra loads (and expecting them both to read the same value). It's aimed at Linux kernel code, where C11 _Atomic isn't how they do things.

Other than the Linux kernel, new code should pretty much always use <stdatomic.h> instead of rolling your own atomics with volatile and inline asm for RMWs and barriers. But that does continue to work because all real-world CPUs that we run threads across have coherent shared memory, so making a memory access happen in the asm is enough for inter-thread visibility, like memory_order_relaxed. See When to use volatile with multi threading? (basically never, except in the Linux kernel or maybe a handful of other codebases that already have good implementations of hand-rolled stuff).

In ISO C11, it's data-race undefined behaviour for two threads to do unsynchronized read+write on the same object, but mainstream compilers do define the behaviour, just compiling the way you'd expect so hardware guarantees or lack thereof come into play.

Other than that, yeah, looks accurate except for your final question 2: there are use-cases for memory_order_relaxed atomics, which is like volatile with no barriers, e.g. an exit_now flag.

or are there are cases where only using barriers suffice ?

No, unless you get lucky and the compiler happens to generate correct asm anyway.

Or unless other synchronization means this code only runs while no other threads are reading/writing the object. (C++20 has std::atomic_ref<T> to handle the case where some parts of the code need to do atomic accesses to data, but other parts of your program don't, and you want to let them auto-vectorize or whatever. C doesn't have any such thing yet, other than using plain variables with/without GNU C __atomic_load_n() and other builtins, which is how C++ headers implement std::atomic<T>, and which is the same underlying support that C11 _Atomic compiles to. Probably also the C11 functions like atomic_load_explicit defined in stdatomic.h, but unlike C++, _Atomic is a true keyword not defined in any header.)

@RohanAggarwal: Forgot to mention: you can get ordering on some ISAs without separate barriers, by using an acquire-load instruction or store-release. (e.g. ARM64 `ldapr` / `stlr`). This is significantly more efficient than a plain (relaxed) load + barrier on ISAs that provide such instructions. You can't get that with `volatile`, only with C11 or with inline asm for the actual load or store itself. (Or of course with GNU C `__atomic` builtins; https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html) — Peter Cordes, Feb 16 '22 at 10:36
If reading the C standard strictly ([as done here](https://stackoverflow.com/a/58697222/584518)), then volatile access must act as a memory barrier and re-ordering across a volatile access is never allowed. That's not how it's done in practice on all systems though. As for atomic access, volatile doesn't guarantee anything - that's a separate issue. — Lundin, Feb 16 '22 at 14:40

score 0 · Answer 2 · answered Jun 17 '23 at 18:22

As far as the Standard is concerned, the semantics of volatile-qualfied memory accesses are explicitly characterized as implementation-defined. They are characterized in this fashion on the presumption that people seeking to sell compilers will seek to understand and satisfy their customers' needs far better than the Committee ever could.

Implementations that seek to be maximally compatible with low-level code written for other implementations will treat avolatile-qualified accesses as though they are preceded and followed by calls to functions that compiler knows nothing about, which might modify any storage that such a function would be able to modify. Depending upon the configuration of the execution environment, such treatment may or may not be sufficient to resolve race conditions. Such treatemnt would be adequate on most single-core (generally embedded) environments, or on an environment that is configured so that all threads associated with a particular program will only be run on one core at a time, and will not be migrated between cores without flushing the cache first. If there are enough independent tasks to keep all cores busy, code which is designed for use with such an environment may be more efficient than code which uses multi-processor synchronization primitives.

Unfortunately, even though every compiler would need to be capable of processing a volatile access that was preceded and followed by a call to function whose behavior the implementation knew nothing about, there is no standard mandatory way of indicating that an all accesses object should be processed in a manner consistent with such semantics. The best one can probably do is define a compiler-vendor specific macro which can be used before and after volatile accesses that may trigger actions that affect the abstract machine state. On some compilers, these macros wouldn't need to do anything, but on others they could use compiler-specific syntax to force a "memory clobber".

C - volatile and memory barriers in lockless shared memory access?

2 Answers2

Linked