64

I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:

#include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
  y.store(1, order);
  y.store(1, order);
}

Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?

If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?

Here's the code in compiler explorer.

curiousguy
  • 8,038
  • 2
  • 40
  • 58
PeteC
  • 1,047
  • 9
  • 15
  • This optimization step probably doesn't lead to a great speed-up in real application compared to the cost of running the optimization step, especially when the code is not trivial. [This talk](https://youtu.be/IB57wIf9W1k) is somewhat related. – nwp Aug 30 '17 at 12:29
  • 21
    And what if `f` is only one thread of many writing to `y`, while there are others reading from `y`? If the compiler coalesces the writes into a single write, then the behavior of the program might change unexpectedly. – Some programmer dude Aug 30 '17 at 12:30
  • 24
    @Someprogrammerdude That behavior wasn't guaranteed before, so it wouldn't make the optimization invalid. – nwp Aug 30 '17 at 12:31
  • 4
    @Someprogrammerdude I'm assuming that situation, and still don't understand. 'f() runs really fast' is always a scheduling possibility, so no valid program could assume it could see each of those distinct writes. – PeteC Aug 30 '17 at 12:33
  • 1
    But you *don't know*, and that's one of the problems. If we don't know all possible use-cases, how would the compiler be able to? – Some programmer dude Aug 30 '17 at 12:36
  • 10
    a very practical argument is: for a compiler it would be hard to reason about the redundancy of the stores in the general case, while for the one writing the code it should be trivial to avoid such redundant writes, so why should compiler writers bother to add such optimization? – 463035818_is_not_an_ai Aug 30 '17 at 12:40
  • 1
    Looks like the answer [here](https://stackoverflow.com/questions/45885048/is-this-compiler-transformation-allowed) might cover it. – NathanOliver Aug 30 '17 at 12:40
  • 3
    @NathanOliver How is that related? A compiler optimization that adds a write which potentially introduces a data race is not at all the same as an optimization that removes redundant thread-safe writes. – nwp Aug 30 '17 at 12:43
  • 3
    @NathanOliver Thanks, but removing the two redundant stores would not "introduce assignments to a potentially shared memory location that would not be modified by the abstract machine", so I don't think that part of the standard helps. – PeteC Aug 30 '17 at 12:44
  • 3
    The issue here is that it is impossible to prove that the stores are redundant. Assume another thread is running that sets `y` to `42` between the 2nd and 3rd stores, `y` would still be `1` at the end of `f`. If the "redundant" stores were removed `y` would be `2` at the end of `f`. – Richard Critten Aug 30 '17 at 12:50
  • 17
    @RichardCritten There is no way to write a C++ program that sets `y` to `42` between the 2nd and 3rd stores. You can write a program that just does the store and maybe you get lucky, but there is no way to guarantee it. It is impossible to tell if it never happened because redundant writes were removed or because you just got unlucky timing, hence the optimization is valid. Even if it *does* happen you have no way to know because it could have been before the first, second or third. – nwp Aug 30 '17 at 12:52
  • 4
    I'm actually relieved to hear no compiler optimizes that. – Michaël Roy Aug 30 '17 at 12:54
  • 6
    The standard committee isn't sure whether they're always ok with aggressive atomic optimizations, so compilers probably just avoid them. See [P0062](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0062r1.html) for some discussion about atomics and aggressive optimizations. The papers does explain that the kind of optimization you're expecting is indeed perfectly allowed, but not always what the user expects. – Morwenn Aug 30 '17 at 13:10
  • 22
    Prosaic answer is that there's probably never been enough code seen that looks like that to make any optimiser-writer decide to be bothered writing an optimisation for it. – TripeHound Aug 30 '17 at 13:21
  • @Morwenn interesting read, thank you. (Good candidate for an answer!) – PeteC Aug 30 '17 at 13:28
  • @TripeHound : Yes. [Duff's device](https://en.wikipedia.org/wiki/Duff%27s_device) is so obscure no one's ever seen it. – Eric Towers Aug 31 '17 at 15:50
  • @TripeHound There isn't a lot of code that creates redundant `shared_ptr` instances in inline functions? – curiousguy Dec 14 '18 at 00:34
  • @Morwenn The std committee isn't sure that it's OK to move atomic operations over very long loops. They don't seem to have given the issue any serious thinking as the same could be said of volatile operations or even regular I/O. (Or even to move code around timing operations.) – curiousguy Jun 09 '19 at 00:52

9 Answers9

53

The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:

  y.store(1, order);
  y.store(2, order);
  y.store(3, order); // inlining + constant-folding could produce this in real code

The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)

Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.

It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.


So why don't compilers do this optimization?

It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.

The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.

There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)

Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.

However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.

Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.


Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:

See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)


Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.

Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).

Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.

I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].

wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:

if(x) {
    foo();
    y.store(0);
} else {
    bar();
    y.store(0);  // release a lock before a long-running loop
    for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.

Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.

volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.


Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.

The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.

Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • So the optimization is (a) valid, (b) hard and (c) surprising. WG21/P0062R1 is particularly interesting. I personally don't feel that the "principle of least surprise" is a golden rule here; at this level of the language you need to know the rules before you play the game. Thanks for helping me understand. – PeteC Aug 31 '17 at 15:36
  • 2
    @PeteC: Yeah, I think it's important to realize that the optimization is allowed, and not doing it is a QOI issue, not a standards-compliance issue, and that something may change in a future standard. – Peter Cordes Aug 31 '17 at 15:38
  • So, C++11/14 are permitted to interact incorrectly with programmed I/O written in the style of [Duff's Device](https://en.wikipedia.org/wiki/Duff%27s_device). Great. – Eric Towers Aug 31 '17 at 15:51
  • 3
    @EricTowers no, in Duff's Device the output register would certainly be declared volatile (this is a textbook case for volatile) and the output would be as expected. – PeteC Aug 31 '17 at 16:17
  • 1
    @PeteC: Given the range of purposes for which languages like C and C++ are used, programsf for some targets and application fields will often need semantics that aren't supportable everywhere; the language itself punts the question of when they should be supported as a QoI issue, but if programmers in a particular field would find a behavior surprising, that's a pretty good sign that quality implementations in that field should not behave in such fashion unless explicitly requested. The language rules itself aren't complete enough to make the language useful for all purposes without POLA. – supercat Jul 17 '18 at 22:33
  • "_and abuses the meaning of volatile_" so does `volatile int w; w=0; delay_loop(1000); w=0;` so lol – curiousguy Dec 08 '18 at 03:24
  • @curiousguy: I decided to reword that, but you'd probably only use a `delay_loop` in non-portable code targeting a specific embedded system. In a lot of cases, `volatile` in embedded programming is used where relaxed or acq/rel atomic would be arguably more appropriate. (At least if you also include static_assert that it's lock-free.) – Peter Cordes Dec 08 '18 at 06:26
  • Either way it doesn't *feel* like any compiler should try to aggressively optimize code by pushing computations across volatile accesses even if it's strictly conforming to the letter of the specification: volatile should be vaguely ordered WRT the pure code around it. And least people who claim optimizing Java volatile/C++ atomic accesses is bad taste should find that optimizing across C/C++ volatile accesses is at least as bad. – curiousguy Dec 08 '18 at 06:38
  • 1
    @curiousguy: agreed, quality implementations probably won't reorder `volatile` with an expensive computation, even if they're tempted to do so by a common tail in both branches. But the standard allows behaviour we don't want, hence it's an issue for at least the standards committee to try to improve. You could just leave it at that and say it's already possible to make a strictly conforming C++ implementation that's near-useless for low-level systems programming, bu a lot of that is by violating assumptions that most code makes, like that integer types don't have padding. Not optimization. – Peter Cordes Dec 08 '18 at 06:43
  • 1
    "_allow the compiler to do these optimizations safely (without ever getting it wrong)_" Detecting bounded cost computation is trivial (any code w/o loop or goto and no outline fun call is trivial); coalescence redundant atomic op occurring with only trivial cost code in between seem trivial. That would handle some `shared_ptr` style relaxed incr followed by release decr I believe. – curiousguy Apr 07 '19 at 21:05
  • > _The stores are required to happen separately because of possible readers waiting to see an intermediate value._ Even though stores are actually happening, there is no guarantee that other readers will see the intermediate value. If a program is dependent on this, then they have to explicitly tell their intention by some kind of `yield` statement. And at which point, compiler is legally not allowed to move around the stores. Isn't this is how it supposed to be? – Sourav Kannantha B Jul 07 '22 at 06:05
  • 1
    @SouravKannanthaB: Correct. Separate stores means they *might* see the intermediate value on real implementations, but there's no *guarantee* that it happens every time, or even that it's observable at all. Note that that partial sentence in my answer is describing a *claim* that other answers make. My answer is not agreeing that it's a *requirement* of the C++ standard. The words right before the part you quoted are "which claim". – Peter Cordes Jul 07 '22 at 09:27
  • For progress bar example, making progress increments in `memory_order_acq_rel` instead of `memory_order_relaxed` achieves the intent right? – Sourav Kannantha B Jul 08 '22 at 04:33
  • @SouravKannanthaB: The acq part of that is only meaningful for an atomic RMW, not just a store. A store with `acq_rel` ordering is like `release`, so all the stores can sink out to the bottom of a loop, or at least the ISO C++ standard doesn't forbid it. (Real compilers almost certainly wouldn't.) – Peter Cordes Jul 08 '22 at 04:42
  • @PeterCordes isn't increment a RMW operation? – Sourav Kannantha B Jul 08 '22 at 04:45
  • If you use `shared.fetch_add(1)` or `shared++`, then yes. But if you're the only writer, that's very inefficient vs. `shared.store(++local_tmp, memory_order_release)`. Especially on x86 where it needs a full barrier, but a plain store can sit in the store buffer while waiting for an RFO, if it's not exclusively owned because another core recently read it. – Peter Cordes Jul 08 '22 at 04:51
  • Would the answer change if we added some (atomic or not) other store in between. I.e. could `y.store(1); y.store(2); a = 1; y.store(3);` be optimized to `a = 1; y.store(3);` Would it depend on the memory_order? – Freek Jul 26 '23 at 21:15
  • @Freek: That could still be optimized to `a=1; y.store(3)`, with the first two stores removed by dead-store elimination. `y.store(1)` has seq_cst semantics, which for a store includes `release` and being part of the total order of SC ops. (In terms of a memory model with local reodering of accesses to coherent cache, no StoreLoad or StoreStore reordering with later SC loads or stores.) `y.store(2); a=1;` is already allowed to reorder to `a=1; y.store(2);`, and can in practice on AArch64 where `stlr` only prevents StoreLoad reordering with `ldar`, not plain stores or `ldapr`. – Peter Cordes Jul 26 '23 at 21:24
  • @PeterCordes right, thanks. If `a` is also an atomic, then this is not allowed right? I.e. the optimization would be `y.store(2); a.store(1); y.store(3)`? – Freek Jul 27 '23 at 06:47
  • @Freek: Right, if you use `=` or `.store` without overriding the default `seq_cst`, then optimization to `a=1; y=2; y=3` isn't allowed. It would be if you used `a.store(1, relaxed)`, but not `release`. Dropping the `y.store(2)` entirely leaving `a.store(1); y.store(3)` is I think not allowed. It's fine in theory that it's impossible for readers to ever observe `if(a==2) y==2;` with seq_cst loads, but the problem is that load of `y` could observe an older value that should have been overwritten by `2` before `a==2` became visible. – Peter Cordes Jul 27 '23 at 07:02
  • @PeterCordes I think in the first comment you mean `a==1` where you say `a==2` now right? Otherwise makes sense to me. Not fully following your second comment. What cases can the reader exactly not distinguish with the newly added `y.store(something)`? Either way, let's say if `a` is _not_ atomic, is there any way to make a reader observer y=2, a=1, y=3 _in order_ (i.e. prevent the reordering of a=1)? Maybe by using atomic_thread_fence? – Freek Jul 27 '23 at 07:54
  • @Freek: yes, `a==1`. Re: your final question about non-atomic `a` - not without data-race UB, which the compiler can assume doesn't happen. Also, non-atomic is about equal in strength to `relaxed` atomic in terms of (lack of) ordering guarantees, and `a=1; y=2, y=3` is a valid reordering (*only* SC ops are part of the total order), then just dead-store elimination. – Peter Cordes Jul 27 '23 at 08:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254690/discussion-between-peter-cordes-and-freek). (re: my hypothetical optimization about maybe omitting the `y.store(2)`; I did think of a UB-free reader that could tell the difference by re-reading `y` multiple times after, so that doesn't work even roughly. Deleted my comments relating to that to de-clutter, see chat if interested.) – Peter Cordes Jul 27 '23 at 08:11
46

You are referring to dead-stores elimination.

It is not forbidden to eliminate an atomic dead store but it is harder to prove that an atomic store qualifies as such.

Traditional compiler optimizations, such as dead store elimination, can be performed on atomic operations, even sequentially consistent ones.
Optimizers have to be careful to avoid doing so across synchronization points because another thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions than they usually would when considering optimizations to atomic operations.
In the case of dead store elimination it isn’t sufficient to prove that an atomic store post-dominates and aliases another to eliminate the other store.

from N4455 No Sane Compiler Would Optimize Atomics

The problem of atomic DSE, in the general case, is that it involves looking for synchronization points, in my understanding this term means points in the code where there is happen-before relationship between an instruction on a thread A and instruction on another thread B.

Consider this code executed by a thread A:

y.store(1, std::memory_order_seq_cst);
y.store(2, std::memory_order_seq_cst);
y.store(3, std::memory_order_seq_cst);

Can it be optimised as y.store(3, std::memory_order_seq_cst)?

If a thread B is waiting to see y = 2 (e.g. with a CAS) it would never observe that if the code gets optimised.

However, in my understanding, having B looping and CASsing on y = 2 is a data race as there is not a total order between the two threads' instructions.
An execution where the A's instructions are executed before the B's loop is observable (i.e. allowed) and thus the compiler can optimise to y.store(3, std::memory_order_seq_cst).

If threads A and B are synchronized, somehow, between the stores in thread A then the optimisation would not be allowed (a partial order would be induced, possibly leading to B potentially observing y = 2).

Proving that there is not such a synchronization is hard as it involves considering a broader scope and taking into account all the quirks of an architecture.

As for my understanding, due to the relatively small age of the atomic operations and the difficulty in reasoning about memory ordering, visibility and synchronization, compilers don't perform all the possible optimisations on atomics until a more robust framework for detecting and understanding the necessary conditions is built.

I believe your example is a simplification of the counting thread given above, as it doesn't have any other thread or any synchronization point, for what I can see, I suppose the compiler could have optimised the three stores.

Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
  • 2
    You refer to N4455, but seem to have an entirely different interpretation of N4455 than I do. Even the first example in N4455 is more complex that your example (adds instead of outright stores), and that example is described as "non-contentious" (that optimizations are possible). And given that N4455 also states LLVM implements some of optimizations mentioned, it's safe to assume that easiest one is certainly implemented. – MSalters Aug 30 '17 at 20:55
  • @MSalters I though the N4455 was a draft honestly, only one optimisation is listed as implemented ([I wasn't able to reproduce it](https://godbolt.org/g/qJ6F3V)). I believe the first example is not really different from mine: both should be optimizable, but are not. However, while I have an understanding of how this work under the hood, I'm not well-founded in C++ standardese. Surely your understanding is better than mine! I'd never want to spread misinformation, if you see a unfixable flaw in this answer please let me know! – Margaret Bloom Aug 30 '17 at 22:14
  • Hmm, might need a bit of reading up what's happening there. As for N4455 being a draft: that's not really the point; it gives us an inside view from the perspective of compiler developers. That also means they're playing with a code base we don't have yet ;) – MSalters Aug 30 '17 at 23:00
  • Having a thread looping on a CAS or atomic load waiting to see `y=2` is a race condition, but just the ordinary bug kind, not the Undefined Behaviour kind (because it's on an `atomic` type). – Peter Cordes Aug 30 '17 at 23:03
  • 3
    @MSalters: As I understand it, compilers could optimize but for now are choosing not to, because that would violate programmer expectations for things like a progress bar. New syntax is needed to allow programmers to choose. The standard as written allows any possible reordering that could happen on the C++ abstract machine to be picked (at compile time) as the ordering that *always* happens, but this is undesireable. See also http://wg21.link/p0062. – Peter Cordes Aug 30 '17 at 23:04
  • @MSalters: posted [my own answer](https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes/45971285#45971285), since there are so many wrong answers to this question. (Unless *I'm* wrong.) This answer is correct about the central question, but the reason for not optimizing is that compilers on purpose don't as a quality-of-implementation issue, even in cases where they could easily spot the optimization. – Peter Cordes Aug 31 '17 at 00:08
  • 3
    @MargaretBloom: 1) sequentially consistent vs. relaxed doesn't matter here (the difference is only relevant when *other* memory locations come into play). 2) In your ``y==2`` check example, there is what I call a logical race, but no data race. This is a very important distinction. Think "unspecified" vs. "undefined" behavior: might ever see ``y==2``, or might not, but no nasal demons. 3) There is *always* a total order on the operations on a single atomic (even with ``relaxed``). The order may just not be predictable. 4) I agree that atomics can be very confusing. ;-) – Arne Vogel Aug 31 '17 at 10:43
  • That's a lot of words and no mention of the fact that the standard guarantees that every single write will be seen on other threads. Since the hardware cannot guess what is going on in other threads, that precludes any write optimization. – Michaël Roy Sep 05 '17 at 12:35
  • @MichaëlRoy I believe the program has to behave *as if* every single write is visible to another thread if the standard says so. This makes room for optimisations. – Margaret Bloom Sep 05 '17 at 12:50
  • 'As if' is not the same as a guarantee. The write has to be done every single time, as ultimately, the guarantee lies with the hardware. Threads and cores do not do ESP. – Michaël Roy Sep 05 '17 at 13:24
  • @MichaëlRoy "_the standard guarantees that every single write will be seen on other threads_" That "guarantee" makes no sense whatsoever. – curiousguy Dec 08 '18 at 03:07
  • atomic variables are declared volatile, so the compiler cannot do the optimization discussed in the question, as defined in the C++ standard. That's one part of the guarantee, the other part of the guarantee is provided by the cpu through its atomic instructions. – Michaël Roy Dec 29 '18 at 06:19
7

While you are changing the value of an atomic in one thread, some other thread may be checking it and performing an operation based on the value of the atomic. The example you gave is so specific that compiler developers don't see it worth optimizing. However, if one thread is setting e.g. consecutive values for an atomic: 0, 1, 2, etc., the other thread may be putting something in the slots indicated by the value of the atomic.

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • 3
    An example of this would be a progress bar that gets the current state from an `atomic` while the worker thread does some work and updates the `atomic` without other synchronization. The optimization would allow a compiler to just write 100% once and not do redundant writes which makes the progress bar not show progress. It is debatable whether such an optimization should be allowed. – nwp Aug 30 '17 at 13:29
  • Maybe the example did not occur verbatim, but only after loads of optimizations like inlining and constant-propagation. Anyway, you are saying can be coalesced, but not worth the bother? – Deduplicator Aug 30 '17 at 16:30
  • 5
    @nwp: The standard as written *does* allow it. Any reordering that's possible on the C++ abstract machine can be chosen at compile time as what *always* happens. This violates programmer expectations for things like progress bars (sinking a atomic stores out of a loop that doesn't touch any other atomic variables, because concurrent access to non-atomic vars is UB). For now, compilers choose not to optimize, even though they could. Hopefully there will be new syntax to control when this is allowed. http://wg21.link/p0062 and http://wg21.link/n4455. – Peter Cordes Aug 30 '17 at 22:43
5

NB: I was going to comment this but it's a bit too wordy.

One interesting fact is that this behavior isn't in the terms of C++ a data race.

Note 21 on p.14 is interesting: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf (my emphasis):

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic

Also on p.11 note 5 :

“Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races.

So a conflicting action on an atomic is never a data race - in terms of the C++ standard.

These operations are all atomic (and specifically relaxed) but no data race here folks!

I agree there's no reliable/predictable difference between these two on any (reasonable) platform:

include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
  y.store(1, order);
  y.store(1, order);
}

and

include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
}

But within the definition provided C++ memory model it isn't a data race.

I can't easily understand why that definition is provided but it does hand the developer a few cards to engage in haphazard communication between threads that they may know (on their platform) will statistically work.

For example, setting a value 3 times then reading it back will show some degree of contention for that location. Such approaches aren't deterministic but many effective concurrent algorithms aren't deterministic. For example, a timed-out try_lock_until() is always a race condition but remains a useful technique.

What it appears the C++ Standard is providing you with certainty around 'data races' but permitting certain fun-and-games with race conditions which are on final analysis different things.

In short the standard appears to specify that where other threads may see the 'hammering' effect of a value being set 3 times, other threads must be able to see that effect (even if they sometimes may not!). It's the case where pretty much all modern platforms that other thread may under some circumstances see the hammering.

Persixty
  • 8,165
  • 2
  • 13
  • 35
  • 4
    Nobody said it was a data race – LWimsey Aug 30 '17 at 14:09
  • 1
    @LWimsey Indeed and it isn't a data race. That's the point. It's data races that the C++ standard concerns itself with. So the reasoning about race-free observers in the OP are irrelevant. C++ has no problem with race-exposed observers and indeed things like `try_lock_for` invite racing! The answer as to why compilers don't optimize that is because it has defined semantics (raceful or otherwise) and the standard wants those to happen (whatever so they may be). – Persixty Aug 30 '17 at 14:16
  • 1
    Spinning on an atomic load of `y` looking for `y==2` is a race condition (and is probably what the OP had in mind when talking about a race-free observer). It's only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind, though. – Peter Cordes Aug 30 '17 at 23:11
2

In short, because the standard (for example the paragaraphs around and below 20 in [intro.multithread]) disallows for it.

There are happens-before guarantees which must be fulfilled, and which among other things rule out reordering or coalescing writes (paragraph 19 even says so explicitly about reordering).

If your thread writes three values to memory (let's say 1, 2, and 3) one after another, a different thread may read the value. If, for example, your thread is interrupted (or even if it runs concurrently) and another thread also writes to that location, then the observing thread must see the operations in exactly the same order as they happen (either by scheduling or coincidence, or whatever reason). That's a guarantee.

How is this possible if you only do half of the writes (or even only a single one)? It isn't.

What if your thread instead writes out 1 -1 -1 but another one sporadically writes out 2 or 3? What if a third thread observes the location and waits for a particular value that just never appears because it's optimized out?

It is impossible to provide the guarantees that are given if stores (and loads, too) aren't performed as requested. All of them, and in the same order.

Damon
  • 67,688
  • 20
  • 135
  • 185
  • 9
    The happens-before guarantees are not violated by the optimization. In a different example they might be, but not in this one. It is clearly possible to provide guarantees for the OP's example. Nothing is being reordered so that part is not relevant to the question. – nwp Aug 30 '17 at 13:37
  • 4
    @Damon Can you be more specific about what parts in the text disallow this optimization ? – LWimsey Aug 30 '17 at 14:06
  • @nwp in general it is not possible to work that out. This example is stupidly trivial. If the programmer knows that writing it three times is exactly the same as writing it once, then they should have just written it once. – OrangeDog Aug 30 '17 at 14:46
  • 2
    @OrangeDog So it is unlikely to appear verbatim. Though it could result from constant-propagation, inlining, and any number of other optimizations. – Deduplicator Aug 30 '17 at 16:17
  • 7
    You are saying there is something disallowing coalescing the write in \[intro.multithread]. **Please quote it**. I cannot find it. – Deduplicator Aug 30 '17 at 16:27
  • 1
    This is simply making stuff up. – T.C. Aug 30 '17 at 21:16
  • 3
    @Deduplicator: There is no such language that guarantees that other threads must sometimes see intermediate values from a sequence of writes in another thread. The fact that compilers avoid such optimizations is a quality-of-implementation issue, until the C++ standards committee adds a way to allow it selectively, because it can be a problem. See [my answer](https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes/45971285#45971285) for some links to standards working-group proposals that back up this interpretation that it's allowed. – Peter Cordes Aug 31 '17 at 00:29
  • @Deduplicator: The paragraphs that I quoted do that in their entity. Reordering is _explicitly_ disallowed (19) as stated, whereas coalescing is not addressed explicitly. It is however _factually_ disallowed because e.g. 15 and 17 cannot be satisfied except in the most contrieved trivial examples which do not contain write-write or read-write, or write-read at all (or, well, no concurrency). There is for example no way of A being earlier in the modification order (in respect to e.g. B) if you optimize out A. – Damon Aug 31 '17 at 09:05
  • @Deduplicator All of which would result from equally trivial and pointless code. – OrangeDog Aug 31 '17 at 09:46
  • 1
    @PeterCordes "There is no such language that guarantees that other threads must sometimes see intermediate values from a sequence of writes in another thread." Huh? Loads of languages have that, usually via `volatile` or explicit memory synchronisation points. – OrangeDog Aug 31 '17 at 10:41
  • 2
    @OrangeDog: I meant for non-`volatile` `atomic`. I think you're right that `volatile atomic` would portably disable such optimizations, which would make it sometimes possible on typical current SMP hardware. (So would a compiler memory barrier like GNU C `asm(""::: "memory")`.) But the C++ standard still doesn't guarantee that a thread spinning on `y` ever sees `y==2`. e.g. on a uniprocessor machine it's *very* unlikely, and deterministic context switches or something could make it impossible. – Peter Cordes Aug 31 '17 at 15:50
  • @PeterCordes I've not looked, but I imagine that various of these standard library items (such as `std::atomic`) mandate memory barriers (or equivalent semantics) if supported by the target system. They certainly do in other languages. – OrangeDog Aug 31 '17 at 15:54
  • 1
    @OrangeDog: normally atomics are only ordered with respect to other atomics, because concurrent access to non-`atomic` variables is still UB (data race). `atomic_signal_fence` orders even vanilla variables on gcc, but IDK if that's an implementation-detail or required by the standard. See https://stackoverflow.com/questions/40579342/is-there-any-compiler-barrier-which-is-equal-to-asm-memory-in-c11 for `atomic_thread_fence` vs. `atomic_signal_fence` on gcc. – Peter Cordes Aug 31 '17 at 16:28
  • @PeterCordes that doesn't change anything. The only way to read a write to an atomic is via the atomic. – OrangeDog Aug 31 '17 at 16:30
  • @OrangeDog: It explains why `atomic_thread_fence` or a release-store is *not* equivalent to `asm("" ::: "memory")`: because a compiler-barrier for non-atomics would block optimizations (of non-atomic operations in the same loop or whatever) that the standard doesn't require it to block. – Peter Cordes Aug 31 '17 at 16:33
  • @PeterCordes "There is no such language that guarantees that other threads must sometimes see intermediate values from a sequence of writes in another thread." is what I'm refuting. Everything else is irrelevant detail. – OrangeDog Aug 31 '17 at 16:34
  • @OrangeDog: Ah, I see what you're saying now. All I was trying to claim is that `y.store(1, mo_relaxed)` doesn't imply any kind of barrier, not that you *can't* get a barrier (especially with language extensions like GNU C++). Just to be clear, did you mean a reordering barrier like x86 `mfence`? `std::atomic` lets you get that kind of barrier. But `std::atomic` doesn't have "memory contents must be consistent" optimization barriers; that's an implementation detail that C++11 doesn't specify, because it only matters for stuff like inline asm that goes beyond pure ISO C++. – Peter Cordes Aug 31 '17 at 16:42
  • If `std::atomic` does not, then it's as broken as `volatile` was in Java 1.3. – OrangeDog Aug 31 '17 at 16:43
  • @OrangeDog How is atomic broken? For which use? – curiousguy Dec 08 '18 at 20:53
2

A practical use case for the pattern, if the thread does something important between updates that does not depend on or modify y, might be: *Thread 2 reads the value of y to check how much progress Thread 1 has made.`

So, maybe Thread 1 is supposed to load the configuration file as step 1, put its parsed contents into a data structure as step 2, and display the main window as step 3, while Thread 2 is waiting on step 2 to complete so it can perform another task in parallel that depends on the data structure. (Granted, this example calls for acquire/release semantics, not relaxed ordering.)

I’m pretty sure a conforming implementation allows Thread 1 not to update y at any intermediate step—while I haven’t pored over the language standard, I would be shocked if it does not support hardware on which another thread polling y might never see the value 2.

However, that is a hypothetical instance where it might be pessimal to optimize away the status updates. Maybe a compiler dev will come here and say why that compiler chose not to, but one possible reason is letting you shoot yourself in the foot, or at least stub yourself in the toe.

Davislor
  • 14,674
  • 2
  • 34
  • 49
  • 2
    Yes the standard allows this, but real compilers don't do these optimization, because there's no syntax for *stopping* them in cases like a progress-bar update, so it's a quality-of-implementation issue. See [my answer](https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes/45971285#45971285) – Peter Cordes Aug 31 '17 at 00:19
  • @PeterCordes Nice answer, especially the links to the actual WG discussions. – Davislor Aug 31 '17 at 00:51
0

The compiler writer cannot just perform the optimisation. They must also convince themselves that the optimisation is valid in the situations where the compiler writer intends to apply it, that it will not be applied in situations where it is not valid, that it doesn't break code that is in fact broken but "works" on other implementations. This is probably more work than the optimisation itself.

On the other hand, I could imagine that in practice (that is in programs that are supposed to do a job, and not benchmarks), this optimisation will save very little in execution time.

So a compiler writer will look at the cost, then look at the benefit and the risks, and probably will decide against it.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • While the answer might not be totally true, a compiler might still need to ensure that it won't broke its own library or other products... Take for example Microsoft compiler, their OS and Office. Ensuring that such product does not depends on the fact such optimization are not made might not be trivial. – Phil1970 Jul 12 '22 at 02:06
-2

Let's walk a little further away from the pathological case of the three stores being immediately next to each other. Let's assume there's some non-trivial work being done between the stores, and that such work does not involve y at all (so that data path analysis can determine that the three stores are in fact redundant, at least within this thread), and does not itself introduce any memory barriers (so that something else doesn't force the stores to be visible to other threads). Now it is quite possible that other threads have an opportunity to get work done between the stores, and perhaps those other threads manipulate y and that this thread has some reason to need to reset it to 1 (the 2nd store). If the first two stores were dropped, that would change the behaviour.

Andre Kostur
  • 770
  • 1
  • 6
  • 15
  • 2
    Is the changed behavior guaranteed? Optimizations change behavior all the time, they tend to make execution faster, which can have a huge impact on timing-sensitive code, yet that is considered valid. – nwp Aug 30 '17 at 14:00
  • The atomic part changes things. That forces the store to be visible to other threads. There's three stores to `y` that must be visible to other threads. If `y` were not atomic, then sure, the optimizer can drop the first two assignments since nothing in this thread could see that they'd been dropped, and nothing guaranteed that the assignments would be visible to other threads. But since it's atomic, and does guarantee the change is visible to other threads, the optimizer cannot drop that code. (Not without somehow validating that _everywhere_ else doesn't use it either.) – Andre Kostur Aug 30 '17 at 14:10
  • But 1 write already makes it visible to other threads. How would the other threads figure out the difference between 1 and 3 writes? – nwp Aug 30 '17 at 14:12
  • See my full answer. Step away from the pathological case where the three stores are literally next to each other. Put some non-trivial code in there to make it more likely that other threads get some time slices between those stores. Have a second thread that all it does is read `y`, if `y` == 1, do something (which is much faster than the extra work we added to the first thread) and set it to 2. This second thread should be triggered 3 times, not just once at the end. – Andre Kostur Aug 30 '17 at 14:20
  • 3
    @AndreKostur 'should be'? If you're relying on that, your program logic is broken. An optimizer's job is to produce a valid output for less effort. 'thread 2 gets no time slices between the stores' is a perfectly valid outcome. – PeteC Aug 30 '17 at 14:40
  • @PeteC So is "thread 2 (and 3, and 4...) gets lots of time slices between the stores", and because it's an atomic store, that must be visible to the other threads. The optimizer is not allowed to optimize away that visible side-effect as it has to assume that something can get between those two stores and may go and do its own store. – Andre Kostur Aug 30 '17 at 15:30
  • 2
    The standard as written *does* allow compilers to optimize away the window for another thread to do something. Your reasoning for that (and stuff like a progress bar), are why real compilers choose not to do such optimizations. See [my answer](https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes/45971285#45971285) for some links to C++ standards discussions about allowing giving programmers control so optimizations can be done where helpful and avoided where harmful. – Peter Cordes Aug 31 '17 at 00:23
  • @AndreKostur "_Put some non-trivial code in there_" what code? Code doing number crunching? doing thread synchronisation? doing I/O? – curiousguy Dec 08 '18 at 18:21
-5

Since variables contained within an std::atomic object are expected to be accessed from multiple threads, one should expect that they behave, at a minimum, as if they were declared with the volatile keyword.

That was the standard and recommended practice before CPU architectures introduced cache lines, etc.

[EDIT2] One could argue that std::atomic<> are the volatile variables of the multicore age. As defined in C/C++, volatile is only good enough to synchronize atomic reads from a single thread, with an ISR modifying the variable (which in this case is effectively an atomic write as seen from the main thread).

I personally am relieved that no compiler would optimize away writes to an atomic variable. If the write is optimized away, how can you guarantee that each of these writes could potentially be seen by readers in other threads? Don't forget that that is also part of the std::atomic<> contract.

Consider this piece of code, where the result would be greatly affected by wild optimization by the compiler.

#include <atomic>
#include <thread>

static const int N{ 1000000 };
std::atomic<int> flag{1};
std::atomic<bool> do_run { true };

void write_1()
{
    while (do_run.load())
    {
        flag = 1; flag = 1; flag = 1; flag = 1;
        flag = 1; flag = 1; flag = 1; flag = 1;
        flag = 1; flag = 1; flag = 1; flag = 1;
        flag = 1; flag = 1; flag = 1; flag = 1;
    }
}

void write_0()
{
    while (do_run.load())
    {
        flag = -1; flag = -1; flag = -1; flag = -1;
    }
}


int main(int argc, char** argv) 
{
    int counter{};
    std::thread t0(&write_0);
    std::thread t1(&write_1);

    for (int i = 0; i < N; ++i)
    {
        counter += flag;
        std::this_thread::yield();
    }

    do_run = false;

    t0.join();
    t1.join();

    return counter;
}

[EDIT] At first, I was not advancing that the volatile was central to the implementation of atomics, but...

Since there seemed to be doubts as to whether volatile had anything to do with atomics, I investigated the matter. Here's the atomic implementation from the VS2017 stl. As I surmised, the volatile keyword is everywhere.

// from file atomic, line 264...

        // TEMPLATE CLASS _Atomic_impl
template<unsigned _Bytes>
    struct _Atomic_impl
    {   // struct for managing locks around operations on atomic types
    typedef _Uint1_t _My_int;   // "1 byte" means "no alignment required"

    constexpr _Atomic_impl() _NOEXCEPT
        : _My_flag(0)
        {   // default constructor
        }

    bool _Is_lock_free() const volatile
        {   // operations that use locks are not lock-free
        return (false);
        }

    void _Store(void *_Tgt, const void *_Src, memory_order _Order) volatile
        {   // lock and store
        _Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
        }

    void _Load(void *_Tgt, const void *_Src,
        memory_order _Order) const volatile
        {   // lock and load
        _Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
        }

    void _Exchange(void *_Left, void *_Right, memory_order _Order) volatile
        {   // lock and exchange
        _Atomic_exchange(&_My_flag, _Bytes, _Left, _Right, _Order);
        }

    bool _Compare_exchange_weak(
        void *_Tgt, void *_Exp, const void *_Value,
        memory_order _Order1, memory_order _Order2) volatile
        {   // lock and compare/exchange
        return (_Atomic_compare_exchange_weak(
            &_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
        }

    bool _Compare_exchange_strong(
        void *_Tgt, void *_Exp, const void *_Value,
        memory_order _Order1, memory_order _Order2) volatile
        {   // lock and compare/exchange
        return (_Atomic_compare_exchange_strong(
            &_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
        }

private:
    mutable _Atomic_flag_t _My_flag;
    };

All of the specializations in the MS stl use volatile on the key functions.

Here's the declaration of one of such key function:

 inline int _Atomic_compare_exchange_strong_8(volatile _Uint8_t *_Tgt, _Uint8_t *_Exp, _Uint8_t _Value, memory_order _Order1, memory_order _Order2)

You will notice the required volatile uint8_t* holding the value contained in the std::atomic. This pattern can be observed throughout the MS std::atomic<> implementation, Here is no reason for the gcc team, nor any other stl provider to have done it differently.

Michaël Roy
  • 6,338
  • 1
  • 15
  • 19
  • 10
    `volatile` has nothing to do with atomics – login_not_failed Aug 30 '17 at 13:18
  • 2
    @login_not_failed But `volatile` has a lot to do with not optimizing away memory accesses, which is one effect of using atomics. Atomics add some really important guarantees on top of that (atomicity, and ordering), but the "don't optimize this away!" semantics apply to both. – cmaster - reinstate monica Aug 30 '17 at 13:29
  • 1
    I didn't say they were declared with volatile - I suspect some implementation actually do. But that atomics should at the very least behave similarly. – Michaël Roy Aug 30 '17 at 13:37
  • @login_not_failed: I had the same initial reaction, but I believe mentioning `volatile` is correct here. It's no surprise that compilers handle `std::atomic` variables differently from regular variables, **much like** they handle `volatile` variables differently, and for similar reasons: in both cases, something unusual makes it possible for the variable's value to change unexpectedly, so the compiler must actually read from and write to the variable instead of caching previous reads/writes. – Max Lybbert Aug 30 '17 at 13:45
  • 3
    It is wrong though. `volatile` does things that `atomic`s don't, specifically `volatile` assumes you do not talk to memory, but to devices, where writing 1, 2, 3 might be a startup sequence that must arrive exactly like that and reading that location might give you the current temperature. `atomic` assumes you are using regular memory where you read what you last wrote. – nwp Aug 30 '17 at 13:56
  • volatile doesn't do anything, apart from disabling some optimizations. And the volatile modifier is attached to a variable it does indeed apply to reads and writes to memory. As a language, C++ doesn't know about devices, and there are no mentions of devices in the standard. The volatile modifier can also be attached to a piece of assembly code, which is how at least some implementations of atomic<> are defined. – Michaël Roy Aug 30 '17 at 14:01
  • @MichaëlRoy you have misunderstood the use of the volatile method qualifier. See https://stackoverflow.com/questions/16746070/ – PeteC Aug 30 '17 at 14:19
  • @PeteC. Of course I didn't. And neither did teh MS stl team. Here is how atomic operations are typically specialized: `inline int _Atomic_compare_exchange_strong_8(volatile _Uint8_t *_Tgt, _Uint8_t *_Exp, _Uint8_t _Value, memory_order _Order1, memory_order _Order2)` You will notice that this function requires a pointer to a volatile. All specializations are like this. – Michaël Roy Aug 30 '17 at 14:31
  • 1
    @MichaëlRoy that quote wasn't part of the STL snippet in your answer. Sure, you can atomically operate on a volatile object. That doesn't mean that atomics are volatile or that volatiles are atomic. Your assertion that "one should expect that [atomics] behave, at a minimum, as if they were declared with the volatile keyword" is flat-out wrong. – PeteC Aug 30 '17 at 14:36
  • That quote is straight from the MS VS2017 stl. version 14.10.25017. file 'xatomic.h' , line 2028. I sure am not going to paste the entire stl atomic implementation. Look at the code in the stl you use and you will see that the atomic variable will get volatile status at some point. – Michaël Roy Aug 30 '17 at 14:44
  • @PeteC The volatile keyword was introduced in C to designate variables that could be modified from within an ISR, and read from the main program thread. That behaviour is very similar to a variable that could be modified from a different thread. This is the very same trait expected from atomic<>. They have no other practical use. Atomics _are_ the volatile variables of the multi-core age. – Michaël Roy Aug 30 '17 at 14:56
  • @MichaëlRoy please read https://stackoverflow.com/questions/8819095/ , I can't improve on the answers by Anthony Williams and James Kanze. – PeteC Aug 30 '17 at 15:09
  • I cannot improve mine. std::atomic<> _are_ volatile variables. And the only reason to exist is to provide the functionality of `volatile` on multi-core CPUs. – Michaël Roy Aug 30 '17 at 15:12
  • 1
    `volatile std::atomic<>` is different from regular `std::atomic<>`. However, they both imply possible asynchronous modification, @nwp. https://stackoverflow.com/a/2479474/224132 – Peter Cordes Aug 30 '17 at 23:18
  • 1
    @MichaëlRoy: In MSVC, `volatile` implies some ordering, so it goes beyond what `volatile` means in C++11. Quoting examples from MSVC is totally pointless. So is talking about the meaning of something like `asm volatile()` in GNU C. Those are both language extensions, and don't tell us anything about what `volatile` means in ISO C++11. See also **[std::memory_order's relationship with volatile](http://en.cppreference.com/w/cpp/atomic/memory_order#Relationship_with_volatile) on cppreference**. – Peter Cordes Aug 30 '17 at 23:20
  • 2
    `volatile atomic y` would actually disallow this optimization, because it implies the store could have a side-effect. (The standard doesn't mention "IO devices", but IIRC it does describe `volatile` accesses as ones that may have side-effects.) – Peter Cordes Aug 31 '17 at 00:36
  • @Peter The very same applies to gcc and any other compilers. The guarantees offered by volatile are actually a subset of the guarantees offered by std::atomic. And that's the reason why you will find that all implementations, of std::atomic, not just msvc's, but boost's, gcc's, and clang as well do use the keyword. std::atomic would simply not work without the variable held being promoted to a volatile variable at some point. The keyword is there to tell the compiler to maintain the ordering. Otherwise, we'd have chaos. – Michaël Roy Aug 31 '17 at 03:23
  • 1
    There is overlap, but `volatile` is not a strict subset of `atomic`. The compiler headers have `volatile atomic` all over the place for the same reason other headers have `const` all over the place: so you *can* use them on `volatile atomic` types, as well as on regular `atomic` types. https://stackoverflow.com/questions/2479067/why-is-the-volatile-qualifier-used-through-out-stdatomic. – Peter Cordes Aug 31 '17 at 04:01
  • 1
    `volatile` provides no ordering guarantees for the order other threads will see your stores. On compilers other than MSVC, it has no more ordering than `mo_relaxed`. Atomic ordering stuff is exposed directly by compilers. For gcc, see https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html. `__atomic_load_n(ptr, __ATOMIC_ACQUIRE);` works without including any headers (https://godbolt.org/g/nrW8ak for x86 and powerpc asm output), it's pure compiler built-in stuff because gcc has those different orderings built-in. `volatile` isn't a necessary part of ``. – Peter Cordes Aug 31 '17 at 04:05
  • Who said `volatile atomic` ?? Certainly not me. You obviously haven't read the stl, nor boost's code. – Michaël Roy Aug 31 '17 at 04:19
  • And `__atomic_load_n` is compiler specific, it is also not a library function, but an intrinsic built in the compiler. – Michaël Roy Aug 31 '17 at 04:29
  • 3
    And you think VS2017's headers *aren't* compiler-specific? /facepalm. Also, the functions that you quote in your answer use `volatile` or `const volatile` on the functions in exactly the way I was talking about: to allow those member functions to be used on `volatile atomic` objects. e.g. `bool _Is_lock_free() const volatile`. If they didn't care about `volatile atomic`, they wouldn't use the `volatile` keyword at all. – Peter Cordes Aug 31 '17 at 15:34
  • After 25 years, I sure can tell the difference between ansi-compliant c++ and a compiler intrinsic. https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html ... I'm speechless. – Michaël Roy Aug 31 '17 at 21:09
  • @MichaëlRoy The link you provided is for GCC extensions to C in gcc 4.1.0 from over 11 years ago. Things have changed in modern C++. – janm Sep 14 '17 at 15:33
  • gcc still supports this **intrinsic**. – Michaël Roy Sep 14 '17 at 15:40
  • After 25 years you have no understanding of volatile in C/C++. volatile cannot possibly help implementing atomics. – curiousguy Dec 08 '18 at 03:15
  • @nwp "_specifically volatile assumes you do not talk to memory, but to devices,_" volatile doesn't assume anything about the semantics of object access; the only assumption is that there is no assumption. Object access matters when dealing with a volatile object: any volatile read must look at the stored value and a volatile write must write a new value. This is useful if you intend to use a debugger and change the value in that object while the program is paused on a breakpoint on a volatile access: changing the value in the debugger will be equivalent with a C/C++ assignment. – curiousguy Dec 08 '18 at 03:20
  • So volatile really guarantees something about stepping through the program. (That implies the program is paused.) Most uses of volatile probably don't deal with external devices. volatile can also be used to implement consume memory ordering in an intelligible and implementable way, unlike the pathetic mess that is the current consume C++ specification. – curiousguy Dec 08 '18 at 03:23
  • @curiousguy: No. atomic is a software-only feature. It basically prevents the compiler from optimizing the variable away in a register. That's the only effect of the volatile keyword. No effect on the processor itself. – Michaël Roy Dec 29 '18 at 06:12
  • @MichaëlRoy "_atomic is a software-only feature_" Did you mean: **volatile** is a software-only feature? – curiousguy Oct 25 '19 at 01:42
  • @MaxLybbert "_so the compiler must actually read from and write to the variable instead of caching previous reads/writes_" What in the std text prevents caching for atomic objects? – curiousguy Oct 25 '19 at 01:52
  • 1
    @curiousguy That's what I meant. `volatile` is a software feature, while `atomic` is a hardware feature. Sorry for the confusion. – Michaël Roy Oct 25 '19 at 17:13