memory_order_relaxed load vs volatile load

Question

What is the difference between reading the value of an atomic_uint with memory_order_relaxed, and reading the value of a volatile unsigned int (assuming the volatile operations are atomic)?

Specifically, let's define:

Solution 1

The "writer" thread writes to an atomic_uint (with any memory order qualifier, from memory_order_relaxed to memory_order_seq_cst)
The "reader" thread does an atomic relaxed read on the same atomic_uint

Solution 2

The "writer" thread writes to a volatile unsigned int
The "reader" thread reads that value

As-is, I know that both cases do not offer any guarantees regarding the ability of the reader to read the value written by the writer. What I'm trying to understand is the difference between the volatile read, and the relaxed atomic read. What does one provide that the other doesn't, when considering read-after-write consistency?

The only difference I see is:

volatile operations cannot be re-ordered between them, while the atomic load can be re-ordered with other atomic operations

Is there something else?

One is undefined behavior (assuming no external synchronization), the other isn't. — T.C., Aug 05 '15 at 22:39
By definition volatile operations cannot be optimized, but atomic operations sometimes can be. — curiousguy, Dec 10 '18 at 02:15
"*By definition volatile operations cannot be optimized*" Except they are by almost all modern platforms. For example, reads and writes to and from memory are done to and from cache instead which is an optimization. They are re-ordered (between cache and memory) on many platforms which is, again, an optimization. — David Schwartz, Dec 10 '18 at 02:56
@DavidSchwartz They cannot be optimized away by the compiler. They are executed as written in the program. You can check that by pausing the program and reading memory, from the POV of the program (from the CPU). The POV of the CPU is the correct one, the POV of the RAM is not. You can use `ptrace` or the local equivalent to do that. — curiousguy, Dec 22 '18 at 18:25
@curiousguy Are you claiming this is required by some standard? Or are you saying this is what happens to be the case on platforms you have experience with? If the former, what is the evidence of this? What standard? If the latter, it's not "by definition", it's just what some platforms happen to do because that's what works best on them. (I've never seen any standard that says the compiler can't optimize something but the CPU can. How weird would it be to distinguish since the standard specifies what the compiler commands the CPU to do.) — David Schwartz, Dec 22 '18 at 18:28
@DavidSchwartz Yes the definition of `volatile` is what I just described. There is no conceivable way this wouldn't be the case. The cache is a transparent optimization so of course it doesn't change anything here. The correct POV is CPU POV. — curiousguy, Dec 22 '18 at 18:29
@curiousguy The definition of `volatile` is that the compiler can't reorder them but the CPU can? The compiler's job is to tell the CPU what to do. There is no way a language standard could prohibit the compiler from doing something but allow the compiler to allow the CPU to do it. That's not even coherent -- all the compiler does is tell the CPU what to do and what not to do. (Your statement that the correct POV is the CPU POV may be what you've seen on some systems. But it's not the definition of anything nor required by any standard.) — David Schwartz, Dec 22 '18 at 18:31
@DavidSchwartz Exactly. The CPU never reorders stuff in a way that break `volatile` semantic. — curiousguy, Dec 22 '18 at 18:34
@curiousguy Anything can be optimized so long as its semantics aren't violated and can't be optimized in a way that violates its semantics. Lots of optimizations are possible on `volatile` without violating its semantics, and those optimizations are made. So "*By definition volatile operations cannot be optimized*" is misleading and false. That was my original point. (Also, the CPU/compiler distinction is misleading and false too. All compilers do is tell CPUs what to do. To say the compiler can't do something but the CPU can is bizarre.) — David Schwartz, Dec 22 '18 at 18:39
@DavidSchwartz It isn't misleading not false. Volatile operations are only guaranteed to be visible from `ptrace` POV or equivalent. You mentioned order of changes of RAM but RAM isn't the place where you can observe the value of volatile objects. The compiler must emit exactly load and store as prescribed by volatile variable operations; the CPU then preserve the order of these operations. From the POV of that CPU. — curiousguy, Dec 22 '18 at 18:45
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/185660/discussion-between-david-schwartz-and-curiousguy). — David Schwartz, Dec 22 '18 at 18:52

score 3 · Answer 1 · answered Aug 06 '15 at 01:24

3

The volatile read isn't guaranteed to be atomic. That means you could read a value that was never written to the variable (and also could never be written by any part of your program). E.g. if your application only ever writes 0xAAAAAAAA or 0xBBBBBBBB to a variable, the volatile read could yield 0xAAAABBBB. Or really anything else, since the standard doesn't specify behavior for when volatile reads and writes are appearing in different threads without other means of synchronization.

I don't know if the standard says it's UB or implementation defined though. I can only say that there are implementations (e.g. MSVC 2005) that define behavior for unsynchronized volatile reads/writes as an extension.

answered Aug 06 '15 at 01:24

Paul Groke

6,259
2
31
32

Ah, yeah - I actually specified atomic_uint thinking that a write to an unsigned int would always be atomic, but obviously that's architecture-dependent. Thanks for the reminder! :) I'm specifically more interested in the read-after-write consistency guarantees of both approaches. I'll edit my post to reflect that. Thanks! – kiv Aug 06 '15 at 03:11
A volatile read is an observable of the execution: it must be really perform as written in the source code. It means that *if* the CPU provides an atomic way to do, the compiler *should* use it. All CPU that I heard of provide a way to read and write an unsigned int atomically. – curiousguy Dec 17 '19 at 02:23
@curiousguy No, volatile does not give you atomicity, and should not be relied upon to do so. The correct solution is to use std::atomic , see https://stackoverflow.com/a/2485177/ , https://web.archive.org/web/20190219170904/https://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/ – Max Barraclough May 05 '20 at 21:35
@MaxBarraclough "_No_" implies you disagree w/ a statement I made. I don't see which: 1)a) What words written implied that "volatile does not give you atomicity"? b) What operation that is naturally atomic when performed in the most obvious and efficient way on a CPU, is not atomic when performed as a volatile operation in C/C++? 2)a) What words written implied that volatile use is a good idea for MT programming in C/C++? b) If you have `atomic` support in your compiler (or some equiv library), why bother w/ `volatile`? – curiousguy May 05 '20 at 23:49
"_Hans Boehm points out that there are only three portable uses for volatile._" If he wrote that, that isn't the brightest thing he wrote, as it's patently wrong, as I explained many times. "_"Declaring your variables volatile will have no useful effect, and will simply cause your code to run a *lot* slower when you turn on optimisation in your compiler."_" Compared to what, operations on atomics? LOL, that's patently wrong. Slower is always meaningless when you don't say what is being compared; lack of clear code to code comparison makes the statement silly. – curiousguy May 05 '20 at 23:52
"_A volatile read or write of a 32-bit int is atomic on most modern hardware, but volatile has nothing to do with it_" LOL, how does that contradict the comment above, which you tried to refute by citing that source? It almost confirms what I wrote, except the "almost" is not needed. If you think not all "modern hardware" make 32-bit int reads or writes atomic, please cite one exception. By qualifying his statement with "most" when the correct word is all, the writer doesn't show a good grasp on HW. – curiousguy May 05 '20 at 23:55
"_a volatile read or write of a 129-bit structure is not going to be atomic_" Again that exactly goes on the direction of what I explained: only operations on scalar types can be used. Not on composite types. Nowhere did I mention using a volatile qualified composite type. Later in your own reference: "_Even if the compiler does not reorder the references,_" That's exactly what I wrote: **w/ a volatile access, the compiler must produce asm code that exactly match source code.** That's what I asserted, no more no less. I'm in 100% agreement with that part of your own reference. – curiousguy May 05 '20 at 23:59
I'm sure there are many programmers who 1) have not taken any course in MT programming; 2) want to write their own MT primitives to avoid the cost of mutex or sem synchronization; 3) feel that compiler optimizations can defeat their home made primitives. Then they might randomly sprinkle their code w/ volatile hoping to discourage the compiler from optimizing their code. Just don't do that. It isn't even MT or volatile specific: that approach to programming by ppl who don't understand the job occurs when ppl sprinkle their code w/ `shared_ptr` because they have no resource ownership design. – curiousguy May 06 '20 at 00:05
And finally: "_Hans Boehm points out that there are only three portable uses for volatile. I'll summarize them here:_" ... "_memory that is modified by an external agent or appears to be because of a screwy memory mapping_" How is another concurrently running thread (say on another CPU) **not** "an external agent" from the POV of the compiler? It fits exactly. The source **you** referenced proves me right. Even the title does: the author took care to qualify the claim with **almost**. "Volatile: Almost Useless for Multi-Threaded Programming" How you could imagine that would PROVE me wrong? – curiousguy May 06 '20 at 00:09
@curiousguy `volatile` and atomic operations are two completely distinct concepts that only happen to be often useful in combination. `volatile` just forces memory access (=disallows optimizing it away) and atomic just forces a memory access that may or may not happen to be atomic (in case it's not optimized away). – Paul Groke May 06 '20 at 11:10
@PaulGroke In fact volatile guarantees that you get the guarantees of the underlying arch in the high level C/C++ loads/stores. If the CPU guarantees atomicity, you get that. It's rarely useful alone, but still, it shouldn't be vilified. Disallowing optimization is sometimes exactly what you want. Most of the time, it isn't, and you should use atomics. Atomics are almost never optimized, but might be in the future. – curiousguy May 06 '20 at 11:27
1

@curiousguy No, it doesn't _guarantee_ that at all, that's the whole point. It _gets_ you atomicity in many cases, but that's no _guarantee_. Also it's entirely non-trivial when "the CPU guarantees" atomicity. E.g. 64 bit accesses on 32 bit x86: the CPUs _has_ instructions to make those atomic (without expensive memory barriers), but `volatile` will not result in atomic accesses. See https://godbolt.org/z/mhbaRr And since using relaxed atomic loads/stores has no performance penalty with modern compilers, there's simply no point in using the wrong tool here. – Paul Groke May 06 '20 at 11:54
ps: And if one really needs a volatile, relaxed atomic load/store, then one should simply use one. The standard library allows for `volatile std::atomic`s. – Paul Groke May 06 '20 at 12:15
@PaulGroke Those atomic instr are not used by default by the compiler when they are not the most efficient choice. But when a single instr is atomic and efficient, for a scalar type, the compiler will never split the operation and write two half separatly. – curiousguy May 06 '20 at 13:05
@curiousguy I know. But I'm trying to make you see that this is _in no way a guarantee_. Calling this a _guarantee_ is just plain _wrong_. If you disagree, please link/refer to the place where this guarantee is given. And even if there was such a guarantee, a guarantee that is platform dependent and dependent on what the compiler considers to be efficient - that wouldn't be a very good guarantee. And depending on such a (non-existent) bad guarantee makes no sense at all, if there are tools like true atomics that can be used instead - without any penalty. – Paul Groke May 06 '20 at 13:14
@curiousguy BTW: What would you expect the outcome to be if you changed the type of the variable to `double`. Same structure as in my example, again x86 32 bit on a common POSIX system (e.g. Linux). The CPU has an instruction to load a `double` atomically, it's even the same instruction that is being emitted for the volatile load. So will the volatile load be atomic? No, it won't. Because you're hit by the same strange alignment issue where a double can be placed on an address that's not sufficiently aligned. See now why this "guarantee" that you are trying to invent isn't worth much? – Paul Groke May 06 '20 at 13:31
@PaulGroke I think I made myself clear. The guarantee w/ volatile load/store is that you get the exact guarantees of the scalar operation on the given CPU. All practically used systems guarantee atomic load/stores on things like `int` and object pointers, and anything word sized and naturally aligned. So of course it's "arch dependent", but it's practically portable. The "place" the guarantee is given is in the CPU doc. Now I could return the pointless questions to you. Where is the behavior of threads and atomics defined in C++? Nowhere. So don't go there! – curiousguy May 06 '20 at 15:02
@PaulGroke "_the same strange alignment issue where a double can be placed on an address that's not sufficiently aligned_" I have no idea what "issue" you have seen. Which compiler misaligned your variables? – curiousguy May 06 '20 at 15:03
@curiousguy Every compiler that wants to be compatible with Linux' x86 32 bit ABI. Check the godbolt link that I provided earlier and look at the disassembly of the "offsetof" functions. Or the simpler example here https://godbolt.org/z/sGSpKG The issue is that while `__alignof(long long)` is 8, `__alignof(struct_that_contains_long_long)` is 4. And the same with `double`. If you use `std::atomic` instead the issue disappears. – Paul Groke May 06 '20 at 15:56
@curiousguy The behavior of threads and atomics is defined in the C++ standard since C++11. And for C in the C standard since C11. And since those are the _only_ things that count when talking about the C++ or C memory model... – Paul Groke May 06 '20 at 16:04
@PaulGroke Then explain that MT semantic to me. When are programs allowed to progress non sequentially? Can any program be proven not to have UB? I have asked these as standalone Q, but those were closed, as they were too non PC for that site where you can't really criticize a std. – curiousguy May 07 '20 at 02:32
@curiousguy If you want to full picture you have to read the standard yourself - which may take you weeks (I'm not implying your stupid, just that the standard is very hard to read with its language and all the different rules interacting in often non-trivial ways etc.) Aside from that no, not every program can be shown to have UB. There are some simple (in comparison) rules that one can follow that are good for 99% of all cases. – Paul Groke May 07 '20 at 11:06
@PaulGroke The standard is easy to read (on that matter) because **it says exactly nothing about what is sequential and what isn't**. Compare w/ the abject complexity of the Java spec about possible reordering and inevitable stores, which apparently common runtime compilers don't even implement and that are not even reasonable expectations according to experts. The way the C++ std gets away w/ not defining anything re: threads is that most ppl believe it's very complicated and give up. That's how they get away w/ essentially a language definition scam. – curiousguy May 07 '20 at 12:34
@curiousguy I'm sorry but this is getting ridiculous. Read "4.7 Multi-threaded execution" (+ all referenced locations like section 32) of n4659 (which is the final working draft for C++17). Or the equivalent sections of another C++ standard >= C++11. – Paul Groke May 07 '20 at 17:51

score 2 · Answer 2 · answered Dec 10 '18 at 02:36

The use of memory_order_relaxed essentially gives a well behaved volatile variable for the purpose of inter-thread communication:

volatile is ultimately specified in term of ABI (Application Binary Interface), it's an external interface; volatiles are interfaces with the external world: an access to such object is part of the observable behavior and cannot ever be optimized away;
atomics are well specified purely in term of internal semantics; the representation of objects isn't part of the specification; an access to such object is part of the abstract machine and can sometimes be optimized away, or reasoned with internally (by the compiler, in term of C/C++ semantics).

memory_order_relaxed load vs volatile load

2 Answers2