168

If there are two threads accessing a global variable then many tutorials say make the variable volatile to prevent the compiler caching the variable in a register and it thus not getting updated correctly. However two threads both accessing a shared variable is something which calls for protection via a mutex isn't it? But in that case, between the thread locking and releasing the mutex the code is in a critical section where only that one thread can access the variable, in which case the variable doesn't need to be volatile?

So therefore what is the use/purpose of volatile in a multi-threaded program?

teroi
  • 1,087
  • 10
  • 19
David Preston
  • 2,001
  • 2
  • 13
  • 11
  • 4
    In some cases, you don't want/need protection by the mutex. – Stefan Mai Dec 29 '10 at 21:26
  • 6
    Sometimes its fine to have a race condition, sometimes it isn't. How are you using this variable? – David Heffernan Dec 29 '10 at 21:28
  • 3
    @David: An example of when it is "fine" to have a race, please? – John Dibling Dec 29 '10 at 21:38
  • 8
    @John Here goes. Imagine you have a worker thread which is processing a number of tasks. The worker thread increments a counter whenever it finishes a task. The master thread periodically reads this counter and updates the user with news of the progress. So long as the counter is properly aligned to avoid tearing there is no need to synchronise access. Although there is a race, it is benign. – David Heffernan Dec 29 '10 at 21:44
  • @David: It would be difficult to evaluate the safety of such a device without a complete examination of the code. Even if an examination concluded that the writes were atomic (questionable) and fully written-through the cache (difficult to tell), I would still reject this as "bad code." It's safetey would be extremely tenuous, and easily broken by the smallest changes to the code. Maintennance programmers would break this device easily, and the problems may not show up in testing. – John Dibling Dec 29 '10 at 21:49
  • 6
    @John The hardware on which this code runs guarantees that aligned variables cannot suffer from tearing. If the worker is updating n to n+1 as the reader reads, the reader doesn't care whether they get n or n+1. No important decisions will be taken since it is only used for progress reporting. – David Heffernan Dec 29 '10 at 21:52
  • @David: I guess I don't know what you mean by "tearing." – John Dibling Dec 29 '10 at 21:58
  • @John re tearing, I offer you the following from Joe Duffy: http://msdn.microsoft.com/en-us/magazine/cc817398.aspx – David Heffernan Dec 29 '10 at 22:07
  • @David: Wow, wall of text. :) But thanks, I haven't read this yet. I will when I get a chance. – John Dibling Dec 29 '10 at 22:09
  • @John It's all excellent stuff but the bit on tearing is only a couple of paragraphs. – David Heffernan Dec 29 '10 at 22:10
  • 1
    http://isvolatileusefulwiththreads.com/ (also @DavidHeffernan, https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong is the must-read piece on "benign" data races) – Jonathan Wakely Mar 18 '15 at 10:57
  • When to use: _never_. – alecov Jan 14 '17 at 18:39
  • @JohnDibling "_An example of when it is "fine" to have a race_" Whenever it's fine to use a mutex. Or an atomic. Pretty much all non trivial MT programs have harmless race conditions. – curiousguy Jul 15 '18 at 03:59
  • Related: [Electrical Engineering Stack Exchange: Using volatile in embedded C development](https://electronics.stackexchange.com/a/409570/26234). This says `volatile` is required in 2 places: 1) for memory-mapped registers, 2) when sharing global variables between an ISR context and your main context. – Gabriel Staples Apr 14 '22 at 23:14

5 Answers5

200

Short & quick answer: volatile is (nearly) useless for platform-agnostic, multithreaded application programming. It does not provide any synchronization, it does not create memory fences, nor does it ensure the order of execution of operations. It does not make operations atomic. It does not make your code magically thread safe. volatile may be the single-most misunderstood facility in all of C++. See this, this and this for more information about volatile

On the other hand, volatile does have some use that may not be so obvious. It can be used much in the same way one would use const to help the compiler show you where you might be making a mistake in accessing some shared resource in a non-protected way. This use is discussed by Alexandrescu in this article. However, this is basically using the C++ type system in a way that is often viewed as a contrivance and can evoke Undefined Behavior.

volatile was specifically intended to be used when interfacing with memory-mapped hardware, signal handlers, and the setjmp machine code instruction. This makes volatile directly applicable to systems-level programming rather than normal applications-level programming.

The 2003 C++ Standard does not say that volatile applies any kind of Acquire or Release semantics on variables. In fact, the Standard is completely silent on all matters of multithreading. However, specific platforms do apply Acquire and Release semantics on volatile variables.

[Update for C++11]

The C++11 Standard now does acknowledge multithreading directly in the memory model and the language, and it provides library facilities to deal with it in a platform-independent way. However the semantics of volatile still have not changed. volatile is still not a synchronization mechanism. Bjarne Stroustrup says as much in TCPPPL4E:

Do not use volatile except in low-level code that deals directly with hardware.

Do not assume volatile has special meaning in the memory model. It does not. It is not -- as in some later languages -- a synchronization mechanism. To get synchronization, use atomic, a mutex, or a condition_variable.

[/End update]

The above all applies to the C++ language itself, as defined by the 2003 Standard (and now the 2011 Standard). Some specific platforms however do add additional functionality or restrictions to what volatile does. For example, in MSVC 2010 (at least) Acquire and Release semantics do apply to certain operations on volatile variables. From the MSDN:

When optimizing, the compiler must maintain ordering among references to volatile objects as well as references to other global objects. In particular,

A write to a volatile object (volatile write) has Release semantics; a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

A read of a volatile object (volatile read) has Acquire semantics; a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary.

However, you might take note of the fact that if you follow the above link, there is some debate in the comments as to whether or not acquire/release semantics actually apply in this case.

Masoud Rahimi
  • 5,785
  • 15
  • 39
  • 67
John Dibling
  • 99,718
  • 31
  • 186
  • 324
  • +1 Very good links. The semantics of `volatile` should not be confused among languages, including what might be found in a dictionary :-) (It carries very strict memory-model semantics in Java, for instance, but that's a *different* language/environment). –  Dec 29 '10 at 21:39
  • 3
    Some compilers do provide additional semantics to volatile that is useful for multithreaded development, but this is definitely not part of the standard. – Michael Dec 29 '10 at 21:41
  • 24
    Part of me wants to downvote this because of the condescending tone of the answer and the first comment. "volatile is useless" is akin to "manual memory allocation is useless". If you can write a multithreaded program without `volatile` it is because you stood on the shoulders of people who used `volatile` to implement threading libraries. – Ben Jackson Dec 29 '10 at 22:19
  • 27
    @Ben just because something challenges your beliefs doesn't make it condescending – David Heffernan Dec 29 '10 at 22:25
  • @John: I think the first comment stating "if you downvote you don't understand" affected my interpretation of the tone. Other than that I read it as a strong statement against the use of `volatile` which is probably correct for most people writing application level code. – Ben Jackson Dec 29 '10 at 22:42
  • 45
    @Ben: no, read up on what `volatile` actually **does** in C++. What @John said is *correct*, end of story. It has nothing to do with application code vs library code, or "ordinary" vs "god-like omniscient programmers" for that matter. `volatile` is unnecessary and useless for synchronization between threads. Threading libraries can't be implemented in terms of `volatile`; it has to rely on platform-specific details anyway, and when you rely on those, you no longer need `volatile`. – jalf Dec 29 '10 at 23:40
  • 6
    @jalf: "volatile is unnecessary and useless for synchronization between threads" (which is what you said) is not the same thing as "volatile is useless for multithreaded programming" (which is what John said in the answer). You are 100% correct, but I disagree with John (partially) - volatile can still be used for multithreaded programming (for a very limited set of tasks) –  Feb 12 '11 at 19:31
  • eg, see the tenth response (by "Spud") in the comments here for a legitimate use of volatile: http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/ (although this IS somewhat x86 specific as other platforms _may_ require cores to flush data to each other, which volatile obviously won't do) –  Feb 12 '11 at 19:31
  • 3
    @Dan: So you've basically said "it is useful! ...assuming this, and ignoring this and that, and...". Anything can be true if you make enough assumptions; and you've certainly ditched C++ as a language with those. – GManNickG May 21 '11 at 06:50
  • @John: well, you never said it was useless for everything and I never said that you said that either. You still said that its useless for multithreaded programming, which is what I'm disputing. Its not particularly useful, but its not entirely useless either. –  May 21 '11 at 19:30
  • 4
    @GMan: Everything that is useful is only useful under a certain set of requirements or conditions. Volatile is useful for multithreaded programming under a strict set of conditions (and in some cases, may even be better (for some definition of better) than alternatives). You say "ignoring this that and.." but the case when volatile is useful for multithreading doesn't ignore anything. You made up something which I never claimed. Yes, the usefulness of volatile is limited, but it does exist - but we can all agree that it is NOT useful for synchronization. –  May 21 '11 at 19:34
  • 2
    @Dan: My point was `volatile` is a C++ language concept, not an implementation concept. Yet you said it was useful because of some implementation details. That has no implications on it's usefulness in C++. – GManNickG May 21 '11 at 19:37
  • @JohnDib : I think you should consider adding to your answer that `volatile` *has* release&aquire [semantics on Visual C++ ](http://msdn.microsoft.com/en-us/library/12a04hfd%28v=vs.80%29.aspx) (also see my answer here: http://stackoverflow.com/questions/6995310/is-volatile-bool-for-thread-control-considered-wrong/6995486#6995486) – Martin Ba Aug 09 '11 at 11:42
  • @Martin: I have recently made an extensive elaborative edit to this answer. I included some platform-specific details. – John Dibling Jul 12 '12 at 19:40
  • I can hardly think of an real world example to use volatile in normal applications-level programming, is there a one? – Baiyan Huang Aug 29 '12 at 08:18
  • What is "Windows 2010"? – fredoverflow Jan 18 '15 at 22:18
  • @FredOverflow: A typo. – John Dibling Jan 20 '15 at 13:55
  • 1
    This needs an update to mention [`std::atomic`](http://en.cppreference.com/w/cpp/header/atomic) – Mgetz Mar 17 '15 at 20:13
  • @Mgetz: the Jan 20 edit included mention of atomic and other devices. – John Dibling Mar 18 '15 at 09:56
  • "_Do not use volatile except in low-level code that deals directly with hardware._" Strongly disagree. Volatile is useful of signal handlers, for some cases of multithread code, for testing... It just isn't a replacement for atomics in most cases. – curiousguy Jun 28 '18 at 01:40
  • 3
    @curiousguy Unix self-pipe trick is more useful in signal handlers than volatile and allows for much more robust code. – Maxim Egorushkin Jul 25 '18 at 11:25
50

In C++11, don't use volatile for threading, only for MMIO

But TL:DR, it does "work" sort of like atomic with mo_relaxed on hardware with coherent caches (i.e. everything); it is sufficient to stop compilers keeping vars in registers. atomic doesn't need memory barriers to create atomicity or inter-thread visibility, only to make the current thread wait before/after an operation to create ordering between this thread's accesses to different variables. mo_relaxed never needs any barriers, just load, store, or RMW.

For roll-your-own atomics with volatile (and inline-asm for barriers) in the bad old days before C++11 std::atomic, volatile was the only good way to get some things to work. But it depended on a lot of assumptions about how implementations worked and was never guaranteed by any standard.

For example the Linux kernel still uses its own hand-rolled atomics with volatile, but only supports a few specific C implementations (GNU C, clang, and maybe ICC). Partly that's because of GNU C extensions and inline asm syntax and semantics, but also because it depends on some assumptions about how compilers work.

It's almost always the wrong choice for new projects; you can use std::atomic (with std::memory_order_relaxed) to get a compiler to emit the same efficient machine code you could with volatile. std::atomic with mo_relaxed obsoletes volatile for threading purposes. (except maybe to work around missed-optimization bugs with atomic<double> on some compilers.)

The internal implementation of std::atomic on mainstream compilers (like gcc and clang) does not just use volatile internally; compilers directly expose atomic load, store and RMW builtin functions. (e.g. GNU C __atomic builtins which operate on "plain" objects.)


Volatile is usable in practice (but don't do it)

That said, volatile is usable in practice for things like an exit_now flag on all(?) existing C++ implementations on real CPUs, because of how CPUs work (coherent caches) and shared assumptions about how volatile should work. But not much else, and is not recommended. The purpose of this answer is to explain how existing CPUs and C++ implementations actually work. If you don't care about that, all you need to know is that std::atomic with mo_relaxed obsoletes volatile for threading.

(The ISO C++ standard is pretty vague on it, just saying that volatile accesses should be evaluated strictly according to the rules of the C++ abstract machine, not optimized away. Given that real implementations use the machine's memory address-space to model C++ address space, this means volatile reads and assignments have to compile to load/store instructions to access the object-representation in memory.)


As another answer points out, an exit_now flag is a simple case of inter-thread communication that doesn't need any synchronization: it's not publishing that array contents are ready or anything like that. Just a store that's noticed promptly by a not-optimized-away load in another thread.

    // global
    bool exit_now = false;

    // in one thread
    while (!exit_now) { do_stuff; }

    // in another thread, or signal handler in this thread
    exit_now = true;

Without volatile or atomic, the as-if rule and assumption of no data-race UB allows a compiler to optimize it into asm that only checks the flag once, before entering (or not) an infinite loop. This is exactly what happens in real life for real compilers. (And usually optimize away much of do_stuff because the loop never exits, so any later code that might have used the result is not reachable if we enter the loop).

 // Optimizing compilers transform the loop into asm like this
    if (!exit_now) {        // check once before entering loop
        while(1) do_stuff;  // infinite loop
    }

Multithreading program stuck in optimized mode but runs normally in -O0 is an example (with description of GCC's asm output) of how exactly this happens with GCC on x86-64. Also MCU programming - C++ O2 optimization breaks while loop on electronics.SE shows another example.

We normally want aggressive optimizations that CSE and hoist loads out of loops, including for global variables.

Before C++11, volatile bool exit_now was one way to make this work as intended (on normal C++ implementations). But in C++11, data-race UB still applies to volatile so it's not actually guaranteed by the ISO standard to work everywhere, even assuming HW coherent caches.

Note that for wider types, volatile gives no guarantee of lack of tearing. I ignored that distinction here for bool because it's a non-issue on normal implementations. But that's also part of why volatile is still subject to data-race UB instead of being equivalent to relaxed atomic.

Note that "as intended" doesn't mean the thread doing exit_now waits for the other thread to actually exit. Or even that it waits for the volatile exit_now=true store to even be globally visible before continuing to later operations in this thread. (atomic<bool> with the default mo_seq_cst would make it wait before any later seq_cst loads at least. On many ISAs you'd just get a full barrier after the store).

C++11 provides a non-UB way that compiles the same

A "keep running" or "exit now" flag should use std::atomic<bool> flag with mo_relaxed

Using

  • flag.store(true, std::memory_order_relaxed)
  • while( !flag.load(std::memory_order_relaxed) ) { ... }

will give you the exact same asm (with no expensive barrier instructions) that you'd get from volatile flag.

As well as no-tearing, atomic also gives you the ability to store in one thread and load in another without UB, so the compiler can't hoist the load out of a loop. (The assumption of no data-race UB is what allows the aggressive optimizations we want for non-atomic non-volatile objects.) This feature of atomic<T> is pretty much the same as what volatile does for pure loads and pure stores.

atomic<T> also make += and so on into atomic RMW operations (significantly more expensive than an atomic load into a temporary, operate, then a separate atomic store. If you don't want an atomic RMW, write your code with a local temporary).

With the default seq_cst ordering you'd get from while(!flag), it also adds ordering guarantees wrt. non-atomic accesses, and to other atomic accesses.

(In theory, the ISO C++ standard doesn't rule out compile-time optimization of atomics. But in practice compilers don't because there's no way to control when that wouldn't be ok. There are a few cases where even volatile atomic<T> might not be enough control over optimization of atomics if compilers did optimize, so for now compilers don't. See Why don't compilers merge redundant std::atomic writes? Note that wg21/p0062 recommends against using volatile atomic in current code to guard against optimization of atomics.)


volatile does actually work for this on real CPUs (but still don't use it)

even with weakly-ordered memory models (non-x86). But don't actually use it, use atomic<T> with mo_relaxed instead!! The point of this section is to address misconceptions about how real CPUs work, not to justify volatile. If you're writing lockless code, you probably care about performance. Understanding caches and the costs of inter-thread communication is usually important for good performance.

Real CPUs have coherent caches / shared memory: after a store from one core becomes globally visible, no other core can load a stale value. (See also Myths Programmers Believe about CPU Caches which talks some about Java volatiles, equivalent to C++ atomic<T> with seq_cst memory order.)

When I say load, I mean an asm instruction that accesses memory. That's what a volatile access ensures, and is not the same thing as lvalue-to-rvalue conversion of a non-atomic / non-volatile C++ variable. (e.g. local_tmp = flag or while(!flag)).

The only thing you need to defeat is compile-time optimizations that don't reload at all after the first check. Any load+check on each iteration is sufficient, without any ordering. Without synchronization between this thread and the main thread, it's not meaningful to talk about when exactly the store happened, or ordering of the load wrt. other operations in the loop. Only when it's visible to this thread is what matters. When you see the exit_now flag set, you exit. Inter-core latency on a typical x86 Xeon can be something like 40ns between separate physical cores.


In theory: C++ threads on hardware without coherent caches

I don't see any way this could be remotely efficient, with just pure ISO C++ without requiring the programmer to do explicit flushes in the source code.

In theory you could have a C++ implementation on a machine that wasn't like this, requiring compiler-generated explicit flushes to make things visible to other threads on other cores. (Or for reads to not use a maybe-stale copy). The C++ standard doesn't make this impossible, but C++'s memory model is designed around being efficient on coherent shared-memory machines. E.g. the C++ standard even talks about "read-read coherence", "write-read coherence", etc. One note in the standard even points the connection to hardware:

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note ]

There's no mechanism for a release store to only flush itself and a few select address-ranges: it would have to sync everything because it wouldn't know what other threads might want to read if their acquire-load saw this release-store (forming a release-sequence that establishes a happens-before relationship across threads, guaranteeing that earlier non-atomic operations done by the writing thread are now safe to read. Unless it did further writes to them after the release store...) Or compilers would have to be really smart to prove that only a few cache lines needed flushing.

Related: my answer on Is mov + mfence safe on NUMA? goes into detail about the non-existence of x86 systems without coherent shared memory. Also related: Loads and stores reordering on ARM for more about loads/stores to the same location.

There are I think clusters with non-coherent shared memory, but they're not single-system-image machines. Each coherency domain runs a separate kernel, so you can't run threads of a single C++ program across it. Instead you run separate instances of the program (each with their own address space: pointers in one instance aren't valid in the other).

To get them to communicate with each other via explicit flushes, you'd typically use MPI or other message-passing API to make the program specify which address ranges need flushing.


Real hardware doesn't run std::thread across cache coherency boundaries:

Some asymmetric ARM chips exist, with shared physical address space but not inner-shareable cache domains. So not coherent. (e.g. comment thread an A8 core and an Cortex-M3 like TI Sitara AM335x).

But different kernels would run on those cores, not a single system image that could run threads across both cores. I'm not aware of any C++ implementations that run std::thread threads across CPU cores without coherent caches.

For ARM specifically, GCC and clang generate code assuming all threads run in the same inner-shareable domain. In fact, the ARMv7 ISA manual says

This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain

So non-coherent shared memory between separate domains is only a thing for explicit system-specific use of shared memory regions for communication between different processes under different kernels.

See also this CoreCLR discussion about code-gen using dmb ish (Inner Shareable barrier) vs. dmb sy (System) memory barriers in that compiler.

I make the assertion that no C++ implementation for other any other ISA runs std::thread across cores with non-coherent caches. I don't have proof that no such implementation exists, but it seems highly unlikely. Unless you're targeting a specific exotic piece of HW that works that way, your thinking about performance should assume MESI-like cache coherency between all threads. (Preferably use atomic<T> in ways that guarantees correctness, though!)


Coherent caches makes it simple

But on a multi-core system with coherent caches, implementing a release-store just means ordering commit into cache for this thread's stores, not doing any explicit flushing. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/). (And an acquire-load means ordering access to cache in the other core).

A memory barrier instruction just blocks the current thread's loads and/or stores until the store buffer drains; that always happens as fast as possible on its own. (Or for LoadLoad / LoadStore barriers, block until previous loads have completed.) (Does a memory barrier ensure that the cache coherence has been completed? addresses this misconception). So if you don't need ordering, just prompt visibility in other threads, mo_relaxed is fine. (And so is volatile, but don't do that.)

See also C/C++11 mappings to processors

Fun fact: on x86, every asm store is a release-store because the x86 memory model is basically seq-cst plus a store buffer (with store forwarding).


Semi-related re: store buffer, global visibility, and coherency: C++11 guarantees very little. Most real ISAs (except PowerPC) do guarantee that all threads can agree on the order of a appearance of two stores by two other threads. (In formal computer-architecture memory model terminology, they're "multi-copy atomic").

Another misconception is that memory fence asm instructions are needed to flush the store buffer for other cores to see our stores at all. Actually the store buffer is always trying to drain itself (commit to L1d cache) as fast as possible, otherwise it would fill up and stall execution. What a full barrier / fence does is stall the current thread until the store buffer is drained, so our later loads appear in the global order after our earlier stores.

(x86's strongly ordered asm memory model means that volatile on x86 may end up giving you closer to mo_acq_rel, except that compile-time reordering with non-atomic variables can still happen. But most non-x86 have weakly-ordered memory models so volatile and atomic<> with relaxed are about as weak as relaxed allows.)


Atomicity

Some compilers (GCC for example) do maintain atomicity for volatile accesses where they don't for plain accesses, for types of register-width or narrower on the target architecture. The Linux kernel relies on this to implement its own atomics using volatile and inline asm() statements for memory ordering, like barriers or AArch64 acquire-loads. See also Who's afraid of a big bad optimizing compiler? for more about why plain non-volatile variables wouldn't work even with memory barriers that stop the compiler from keeping things in registers.

See Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes for an example of a plain uint64_t assignment not being guaranteed atomic on AArch64, even though it's not optimized away. With a constant with two identical halves, GCC uses stp to store the same register twice; early AArch64 revisions didn't guarantee atomicty. But with volatile, it constructs the full 64-bit constant in a register for one plain store, which is guaranteed atomic if naturally aligned.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/202045/discussion-on-answer-by-peter-cordes-when-to-use-volatile-with-multi-threading). – Samuel Liew Nov 08 '19 at 10:43
  • 5
    Great write-up. This is exactly what I was looking for (giving *all* the facts) instead of a blanket statement that just says "use atomic instead of volatile for a single global shared boolean flag". – bernie Nov 28 '19 at 14:42
  • 3
    @bernie: I wrote this after getting frustrated by repeated claims that not using `atomic` could lead to different threads having different values for the same variable *in cache*. /facepalm. In cache, no, in CPU *registers* yes (with non-atomic variables); CPUs use coherent cache. I wish other questions on SO weren't full of explanations for `atomic` that spread misconceptions about how CPUs work. (Because that's a useful thing to understand for performance reasons, and also helps to explain why the ISO C++ atomic rules are written as they are.) – Peter Cordes Nov 28 '19 at 14:47
  • @PeterCordes _With the default seq_cst ordering you'd get from while(!flag), it also adds ordering guarantees wrt. non-atomic accesses_ are you saying that mo_seq_cst forbids reordering of non-mo_seq_cst around mo_seq_cst? – Daniel Nitzan Jan 29 '21 at 22:18
  • @DanielNitzan: yes, a seq_cst load can synchronize-with a release or seq-cst store in another thread, so any loads in the source after that spin-wait had better be after it in the asm as well. Because ISO C++ says it's safe to read non-atomic variables that were written before that release-store (as long as they aren't still being written by other later stores). It's not a 2-way barrier, though; in theory a seq_cst load could happen earlier than it appears in source order. In practice IDK if gcc/clang will combine earlier accesses with later across a seq_cst load. (rough descriptions...) – Peter Cordes Jan 29 '21 at 22:55
  • @PeterCordes oh, right, seq_cst operations have strel/ldacq semantics. – Daniel Nitzan Jan 30 '21 at 12:36
  • @PeterCordes _And an acquire-load means ordering access to cache in the other core_; [here](https://stackoverflow.com/a/58071051/3525027) you claimed that the LoadLoad part of the read-acquire semantics is guaranteed by the memory order machine clear. This is cheaper than blindly ordering load accesses to the cache. – Daniel Nitzan Jan 30 '21 at 12:43
  • @PeterCordes _A memory barrier instruction just blocks the current thread's loads and/or stores until the store buffer drains;_ should be a _full_ memory barrier? or at least specifically a StoreLoad barrier? – Daniel Nitzan Jan 30 '21 at 16:33
  • _What a full barrier / fence does is stall the current thread until the store buffer is drained_ Not a complete stall, OoO execution can still proceed, i.e. arithmetic instructions on registers, speculative execution of stores and loads etc – Daniel Nitzan Jan 30 '21 at 16:46
  • @DanielNitzan: yes, sometimes I simplify to get the basic concept across, glossing over details like that. Note that MFENCE on Skylake unfortunately *does* stall the whole thread, as [an implementation detail post microcode update](https://stackoverflow.com/a/50496379). Apparently Intel wanted MFENCE to order NT loads from WC memory, but it previously didn't, and the only way they could do that with just a microcode update was to basically add LFENCE to it. But fortunately they didn't slow down [`lock`ed instructions](https://stackoverflow.com/q/40409297), so they're the preferred barrier. – Peter Cordes Jan 30 '21 at 16:52
  • @DanielNitzan: "*blocks the current thread's loads and/or stores until the store buffer drains*" - you're right that doesn't apply to a LoadLoad or LoadStore barrier. For StoreStore, it blocks *commit* of younger stores until older stores have committed, i.e the order in which the store buffer drains. That's effectively like putting a divider on a grocery-store conveyor belt, and does amount to draining the SB before any younger stores (can become visible). Anyway, edited to fix that, thanks; that was specific enough to actually be a problem. – Peter Cordes Jan 30 '21 at 18:00
  • What about non-cache coherent architectures like Intel SCC (Single Chip Cloud)? I can imagine that a language running on such a platform will stick to the _strictest and minimal guarantees_ that are allowed in its memory model. The guarantees provided by (cross-thread) cache coherence are not a part of that memory model. That would mean that relying on caching coherence is _strictly_ writing non-portable code. **Please tell me where am I wrong** (the same might apply for C++ code that runs as bytecode on a JVM; need research here) – Emmef Sep 15 '21 at 10:24
  • @Emmef: You don't run threads of a single process (instance of a C++ program) across the cores of such a machine. Instead, each coherency-domain runs a separate instance of the program, with message passing such as MPI potentially using shared memory with explicit flushing of those regions, like I mentioned in my answer. As the [SCC wikipedia](https://en.wikipedia.org/wiki/Single-chip_Cloud_Computer) says, "you get a functional processor that is fast ... with a framework *resembling a network of cloud computers*". Notably *not* resembling an SMP system. – Peter Cordes Sep 15 '21 at 13:09
  • @Emmef: I think this answer does mention the hypothetical possibility of a C++ implementation that *does* run `std::thread` across cores that aren't cache-coherent. It's not impossible, the data-race UB rules make that possible as long as every release operation does explicit flushing on all previous non-atomic stores, or some more sophisticated mechanism to just sync the actually-shared data. That was the point of the "*In theory: C++ threads on hardware without coherent caches*" section of my answer, that it's unlikely but possible (and would break any assumptions based on cache coherence). – Peter Cordes Sep 15 '21 at 13:12
  • @Emmef: If you have a suggestion to make that part of my answer clearer, let me know. – Peter Cordes Sep 15 '21 at 13:15
31

(Editor's note: in C++11 volatile is not the right tool for this job and still has data-race UB. Use std::atomic<bool> with std::memory_order_relaxed loads/stores to do this without UB. On real implementations it will compile to the same asm as volatile. I added an answer with more detail, and also addressing the misconceptions in comments that weakly-ordered memory might be a problem for this use-case: all real-world CPUs have coherent shared memory so volatile will work for this on real C++ implementations. But still don't do it.

Some discussion in comments seems to be talking about other use-cases where you would need something stronger than relaxed atomics. This answer already points out that volatile gives you no ordering.)


Volatile is occasionally useful for the following reason: this code:

/* global */ bool flag = false;

while (!flag) {}

is optimized by gcc to:

if (!flag) { while (true) {} }

Which is obviously incorrect if the flag is written to by the other thread. Note that without this optimization the synchronization mechanism probably works (depending on the other code some memory barriers may be needed) - there is no need for a mutex in 1 producer - 1 consumer scenario.

Otherwise the volatile keyword is too weird to be useable - it does not provide any memory ordering guarantees wrt both volatile and non-volatile accesses and does not provide any atomic operations - i.e. you get no help from the compiler with volatile keyword except disabled register caching.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
zeuxcg
  • 9,216
  • 1
  • 26
  • 33
  • 4
    If I recall, C++0x atomic, is meant to do properly what a lot of people believe (incorrectly) is done by volatile. – David Heffernan Dec 29 '10 at 21:33
  • Yeah. It's funny, I've added a line about C++0x atomic facilities, but then removed it since it seemed not related to the exact question :) C++0x atomic is as it should have been - concrete load/store semantics on each access are very useful. – zeuxcg Dec 29 '10 at 21:35
  • 15
    `volatile` doesn't prevent memory accesses from being reordered. `volatile` accesses won't be reordered with respect to each others, but they provide *no* guarantee about reordering with respect to non-`volatile` objects, and so, they're basically useless as flags as well. – jalf Dec 29 '10 at 23:42
  • 1
    I think most of the "volatile is useless" crowd are relying on the fact that more likely code such as `while (!global_flag) { sleep(1); }` will work without `volatile` because their compilers don't look into `sleep()` when optimizing and thus assume that `sleep()` may modify `global_flag` and thus the right code is produced for the wrong reasons. Perhaps a suitable example could be constructed with LLVM link-time optimizations? – Ben Jackson Dec 30 '10 at 01:00
  • But that code is using busy waiting - something to be avoided. – David Preston Jan 03 '11 at 17:23
  • 17
    @Ben: I think you've got it upside down. The "volatile is useless" crowd relies on the simple fact that *volatile does not protect against reordering*, which means it is utterly useless for synchronization. Other approaches might be equally useless (as you mention, link-time code optimization might allow the compiler to peek into code you assumed the compiler would treat as a black box), but that doesn't fix the deficiencies of `volatile`. – jalf Jan 05 '11 at 20:02
  • @jalf: some uses of flags do not require that they are not reordered (though most do, so while there is a very very niche use case where using volatile is perfectly fine, there are lots and lots of people who misuse volatile in unsafe ways) –  Feb 12 '11 at 19:35
  • 1
    @Dan: name a situation where reordering is not a problem. The flag is used to indicate that some event has occurred, and that only works if the event *has* actually occurred when the flag is set. Which other cases do you have in mind? – jalf Feb 13 '11 at 04:56
  • 16
    @jalf: See the article by Arch Robinson (linked elsewhere on this page), 10th comment (by "Spud"). Basically, the reordering does not change the logic of the code. The posted code uses the flag to cancel a task (rather than to signal the task is done), so it doesn't matter if the task is cancelled before or after the code (eg: `while (work_left) { do_piece_of_work(); if (cancel) break;}`, if the cancel is reordered within the loop, the logic is still valid. I had a piece of code which worked similarly: if the main thread wants to terminate, it sets the flag for other threads, but it doesn't... –  Feb 13 '11 at 14:03
  • 16
    ...matter if the other threads do an extra few iterations of their work loops before they terminate, as long as it happens reasonably soon after the flag is set. Of course, this is the ONLY use that I can think of and its rather niche (and may not work on platforms where writing to a volatile variable does not make the change visible to other threads, though on at least x86 and x86-64 this works). I certainly wouldn't advise anybody to actually do that without a very good reason, I'm just saying that a blanket statement like "volatile is NEVER useful in multithreaded code" is not 100% correct. –  Feb 13 '11 at 14:06
  • 2
    Since I was looking for a good answer I could link to why volatile is useless, I can't let these comments here stand. What Dan is saying here only works for x86 because of its strong underlying memory model. What he's proposing here is equally broken as every other usecase of volatile on many other platforms (e.g. there's no guarantee that you don't read a stale value from the cache). So yes if you want a program that doesn't just work under x86, volatile is really *never* useful.. – Voo Mar 02 '14 at 16:46
  • 2
    My summary of all of this is: (1.) there exist examples where volatile is correct/useful for multithreading (@Dan's example: use a volatile bool to stop a thread's loop. [another example](http://stackoverflow.com/a/246392/52074)) (1.addendum) the examples where volatile is correct/useful for multithreading are compiler/implementation dependent (one compiler may have atomic-like behaviour for volatile while another may not) AND hardware dependent (arm vs x86) (3.) the statement "volatile is NEVER correct/useful in multithreaded code" is wrong because there exist at least one counter example – Trevor Boyd Smith May 06 '15 at 15:22
  • @Voo: There are *many* platforms where it's possible to cheaply ensure that a write to a volatile write-once variable will *eventually* be seen on all other threads (within a few seconds); indeed, optimization could be assisted if there were a qualifier which were looser than `volatile` so that a compiler would be free to optimize out a *bounded* number of consecutive accesses [e.g. if a compiler unrolls a loop 8 times, it would only have to check the variable once in the unrolled loop, rather than checking eight times]. The overhead imposed by a mechanism with loose semantics... – supercat Jul 16 '15 at 17:53
  • ...could be much less than would be necessary when using a mechanism with more rigid semantics. – supercat Jul 16 '15 at 17:53
  • 2
    @supercat which is exactly what std::atomic does with the memory models weaker than acquire/release. – Voo Jul 16 '15 at 18:30
  • 1
    @Trevor So you also think that "referencing one past the end of an array" is not always wrong because there's one example where it "works"? That's not how c works and particularly dangerous thinking if concurrency is involved. – Voo Jul 16 '15 at 19:11
  • @Voo if by _"referencing one past the end of an array"_ you mean "_point_ one past the end of an array", then you're sloppy with terminology as that _doesn't_ reference the memory it points at. – Johann Gerell Oct 09 '15 at 06:42
  • 3
    @Johann Are you just nitpicking or really confused? I would have thought any C++ programmer would know that accessing memory outside an array is undefined behavior, but still works often enough - which is the whole point of the argument there. But yes I lost "memory" there and no I didn't mean pointing one past the end of an array because that's not undefined behavior. – Voo Oct 09 '15 at 07:54
  • @Voo No, I genuinely did not know if you meant _pointing at_ when you said _referencing_ - I don't consider it nitpicking to point (sic!) out when those two terms are misused, since they are so completely different. Now that you clarified that you meant that a memory access by _referencing_ "one past the array end" can _seemingly_ succeed, I perfectly understand your statement. – Johann Gerell Oct 09 '15 at 08:00
  • 1
    @Voo Just to clarify why I replied in the first place; you wrote _"I would have thought any C++ programmer would know that accessing memory outside an array is undefined behavior"_ but I would argue that the vast majority of C++ developers _don't_ know what UB means and _don't_ know the difference between _point at_ and _reference_. That majority is like black matter and unlikely to be found on SO or conferences or reading C++ blogs - they are still the majority. – Johann Gerell Oct 09 '15 at 08:28
  • 1
    @Voo, re [your earlier comment](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/4558031#comment33577468_4558024) about reading stale values from cache: Everything uses coherent data caches (MESI) https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/; an asm store in one core will quickly be seen by asm loads in other cores. `volatile` does work in practice for a `keep_running` flag that doesn't need ordering wrt. anything else, but `atomic` with `mo_relaxed` will give you the same asm as volatile and doesn't have UB. – Peter Cordes Oct 24 '19 at 04:54
  • 1
    @Voo: posted [my own answer](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118), and updated this one with a disclaimer. This comment thread seems to be a mess, with some people arguing against volatile for because it gives no ordering even though the use case in this answer doesn't need any. (Assuming it's a keep_running flag, not a data_ready flag.) – Peter Cordes Oct 24 '19 at 06:33
  • @Peter "Everything uses coherent data caches ", everything? I mean sure your run of the mill x64 CPU will, but every ISA ever invented now and in the future? There's some *very* weird architectures out there. I can't imagine that all those DSPs, or mixed CPU SOCs all guarantee anything close to MESI concurrency. – Voo Oct 24 '19 at 10:50
  • 2
    @Voo: I don't just mean x86-64. I mean PowerPC, MIPS, ARM, RISC-V, etc. etc. All SMP systems that a C++ implementation would want to start `std::thread`s across, or that can run a single system image kernel. Mixed SOCs with non-coherent shared memory apparently exist, but not AFAIK as compiler targets for ISO C++ compliant compilers. ([Discussion in comments about them](https://stackoverflow.com/questions/58516052/multithreading-program-stuck-in-optimized-mode-but-runs-normally-in-o0?noredirect=1#comment103396268_58516052), which I did link in my answer on this question.) – Peter Cordes Oct 24 '19 at 10:55
  • @Peter But that's my point: Now we already weakened an already weak claim to "no compiler I currently know exists for these architectures I didn't know 4 hours ago existed". But why go through all this trouble when there's a simple solution that is actually guaranteed to work perfectly fine? (Granted I'm reasonably sure all those DSPs do break several rules of the standard, but do all future ones also?) – Voo Oct 24 '19 at 11:00
  • 4
    *But why go through all this trouble when there's a simple solution that is actually guaranteed to work perfectly fine?* FFS, why do people keep assuming I'm advocating for the use of `volatile`??? This is really frustrating. I'm trying to promote understanding of caches in mainstream CPUs so people can make correct decisions about performance when *using* `atomic` correctly (perhaps with `mo_relaxed`). Did you read my edit to this answer? Or any of the bolded or ##Heading text in my answer? They're all carefully worded to say *don't* actually use `volatile`, but here's how CPUs work. – Peter Cordes Oct 24 '19 at 11:08
  • 1
    @Voo You are missing the point. It isn't that weird CPU don't exist, or uncommon ways to make common CPU to work on the same motherboard. It's that **no sane thread implementation strategy is imaginable on these.** Any C or C++ compilers tries to have predictable performance for at least common code patterns. *That you can emulate the Turing machine on some CPU does not mean you can have a reasonable compiler.* You are like ppl arguing that vtables are not std. All compilers use them. – curiousguy Nov 10 '19 at 08:46
  • @curiousguy "It's that no sane thread implementation strategy is imaginable on these". You are aware that those weird CPUs do having threading implementations (just not standard conform ones) right? Lack of imagination is not the same as impossible. – Voo Nov 10 '19 at 09:55
  • @Voo Of course weird CPU might have some extremely weird threads. But it doesn't matter from the POV of portable code **as these special compilers don't try to support normal code**, need specific barriers and specific synchronization devices, etc. Any system where you need to purge caches, at great cost, to get memory visibility isn't going to run portable MT code. Code is going to be written specifically under those constraints. – curiousguy Nov 10 '19 at 10:02
  • Just imagine a `shared_ptr` on such system w/o coherent caches. On any destruction of an instance where `count>1` you would have to write your modified cache content to main memory, and when `count==1` you even need to write back the cache and then purge it to get fresh data? Could you even have dynamic allocation on a shared domain on that system? – curiousguy Nov 10 '19 at 10:34
-1

You need volatile and possibly locking.

volatile tells the optimiser that the value can change asynchronously, thus

volatile bool flag = false;

while (!flag) {
    /*do something*/
}

will read flag every time around the loop.

If you turn optimisation off or make every variable volatile a program will behave the same but slower. volatile just means 'I know you may have just read it and know what it says, but if I say read it then read it.

Locking is a part of the program. So ,by the way, if you are implementing semaphores then among other things they must be volatile. (Don't try it, it is hard, will probably need a little assembler or the new atomic stuff, and it has already been done.)

ctrl-alt-delor
  • 7,506
  • 5
  • 40
  • 52
  • 1
    But isn't this, and the same example in the other response, busy waiting and thus something that should be avoided? If this is a contrived example, are there any real life examples that aren't contrived? – David Preston Jan 03 '11 at 17:19
  • Ok yes. You realy need to do something in the braces (I will edit post). Busy waiting is usualy a bad idea. You may be processing something (a list) until another thread signals you to stop. Without the valatile it will continue forever, and for this example no lock is needed, bool is atomic. – ctrl-alt-delor Jan 05 '11 at 17:54
  • 9
    @Chris: Busy waiting is occasionally a good solution. In particular, if you expect to only have to wait for a couple of clock cycles, it carries far less overhead than the much more heavyweight approach of suspending the thread. Of course, as I've mentioned in other comments, examples such as this one are flawed because they assume reads/writes to the flag won't be reordered with respect to the code it protects, and no such guarantee is given, and so, `volatile` isn't really useful even in this case. But busy waiting is an occasionally useful technique. – jalf Jan 05 '11 at 20:05
  • @jalf my understanding of things is that volatile tells to compiler that the variable can be read/written asynchronously to the program (by another thread or my hardware) it is supposed to give the semantic of “if I say read or write then read or write, and do it when I tell you to.”. If the CPU is re-ordering instruction that change these semantics, then the compiler has to defeat this optimisation as well. Am I missing something. – ctrl-alt-delor Jan 10 '15 at 10:48
  • 3
    @richard Yes and no. The first half is correct. But this only means that the CPU and compiler are not allowed to reorder volatile variables with respect to each others. If I read a volatile variable A, and then read a volatile variable B, then the compiler must emit code that is guaranteed (even with CPU reordering) to read A before B. But it makes no guarantees about all the non-volatile variable accesses. They can be reordered around your volatile read/write just fine. So unless you make *every* variable in your program volatile, it won't give you the guarantee you're interested in – jalf Jan 10 '15 at 11:45
  • 1
    @jalf That is not true. There is no requirement that `volatile` prevent CPU reordering and on most modern platforms, it does not actually do so. – David Schwartz Jun 27 '16 at 19:57
  • @DavidSchwartz Do you just go around posting false things about `volatile`? Do you have a different Standard than these people? http://stackoverflow.com/questions/2535148/volatile-qualifier-and-compiler-reorderings – underscore_d Jul 07 '16 at 02:07
  • @underscore_d Read the comments to the two answers that appear to disagree with me and you'll see that they do not. You will *not* find accesses to `volatile`s emitting memory barriers, so reordering is *not* prevented. – David Schwartz Jul 07 '16 at 04:09
  • @DavidSchwartz We mean reordering of reads/writes to that specific `volatile`, which is guaranteed not to occur - not reordering of any said read/write relative to other statements (whether or not they're also `volatile`), which is allowed to happen. Is the latter what you mean? – underscore_d Jul 07 '16 at 09:33
  • @underscore_d Again, you will see that on pretty much every compiler, reads and writes to `volatile` variables do *not* emit memory barriers, even on platforms that require them to prevent reads and writes from being reordered and/or coalesced. Try it. (Do you agree that the CPU can coalesce writes to ordinary memory? Do you see the compiler doing anything to stop it when two writes occur to the same `volatile`? What about writes to two adjacent `volatile short`s, or an "ABA" write to them? Do you see anything emitted to stop write combining?) – David Schwartz Jul 07 '16 at 09:42
  • @DavidSchwartz "_Do you agree that the CPU can coalesce writes to ordinary memory?_" Who cares? – curiousguy Jul 15 '18 at 04:07
  • @curiousguy Well, because if it can coalesce two writes, it can't possibly preserve the ordering of them. The point is that there is absolutely no restrictions on the CPUs ability to reorder, modify, or merge the writes. Nothing imposes any such restriction and actual CPUs do that. – David Schwartz Jul 16 '18 at 00:53
  • Should not the C compiler, issue boundary instructions between volatile read/writes, to ensure that volatile is honoured?, Or are there CPUs that do not allow volatile to be implemented? – ctrl-alt-delor Jul 16 '18 at 08:48
  • @DavidSchwartz What is the definition of "coalesce two writes"? – curiousguy Jul 16 '18 at 14:44
  • @curiousguy The code says to make two or three writes and the CPU instead does one or two. For example, you do `i=1; j=2; k=3;` and the CPU does `i=1; k=3;` in a single, atomic operation. Nothing prevents a CPU from doing this even if `i`, `j`, and `k` are `volatile` and real CPUs do this. CPUs can even optimize out writes to `volatile`s and real-world CPUs actually do this. So `i=1; j=2; i=3;` can result in the `i=1;` write getting optimized out, again, even if `i` is `volatile`. This is why many CPUs have memory barrier operations and `volatile` doesn't use them nor is it required to. – David Schwartz Jul 17 '18 at 19:49
  • Indeed with all variables starting with 0, if `i=1; j=2; i=3;` is turned into `j=2; i=3;` that would break another thread expecting (incorrectly) that `j==2` can only happen if `i>0`. – curiousguy Jul 18 '18 at 19:19
  • 4
    @ctrl-alt-delor: That's not what `volatile`'s "no reordering" means. You're hoping it means that the stores will become *globally* visible (to other threads) in program order. That's what `atomic` with `memory_order_release` or `seq_cst` gives you. But `volatile` *only* gives you a guarantee of no *compile-time* reordering: each access will appear in the asm in program order. Useful for a device driver. And useful for interaction with an interrupt handler, debugger, or signal handler on the current core/thread, but not for interacting with other cores. – Peter Cordes Oct 24 '19 at 04:44
  • 1
    `volatile` in practice is sufficient for checking a `keep_running` flag like you're doing here: Real CPUs always have coherent caches that don't require manual flushing. But there's no reason to recommend `volatile` over `atomic` with `mo_relaxed`; you'll get the same asm. – Peter Cordes Oct 24 '19 at 04:46
-1
#include <iostream>
#include <thread>
#include <unistd.h>
using namespace std;

bool checkValue = false;

int main()
{
    std::thread writer([&](){
            sleep(2);
            checkValue = true;
            std::cout << "Value of checkValue set to " << checkValue << std::endl;
        });

    std::thread reader([&](){
            while(!checkValue);
        });

    writer.join();
    reader.join();
}

Once an interviewer who also believed that volatile is useless argued with me that Optimisation wouldn't cause any issues and was referring to different cores having separate cache lines and all that (didn't really understand what he was exactly referring to). But this piece of code when compiled with -O3 on g++ (g++ -O3 thread.cpp -lpthread), it shows undefined behaviour. Basically if the value gets set before the while check it works fine and if not it goes into a loop without bothering to fetch the value (which was actually changed by the other thread). Basically i believe the value of checkValue only gets fetched once into the register and never gets checked again under the highest level of optimisation. If its set to true before the fetch, it works fine and if not it goes into a loop. Please correct me if am wrong.

Anu Siril
  • 37
  • 1