Can atomics suffer spurious stores?

Question

In C++, can atomics suffer spurious stores?

For example, suppose that m and n are atomics and that m = 5 initially. In thread 1,

    m += 2;

In thread 2,

    n = m;

Result: the final value of n should be either 5 or 7, right? But could it spuriously be 6? Could it spuriously be 4 or 8, or even something else?

In other words, does the C++ memory model forbid thread 1 from behaving as though it did this?

    ++m;
    ++m;

Or, more weirdly, as though it did this?

    tmp  = m;
    m    = 4;
    tmp += 2;
    m    = tmp;

Reference: H.-J. Boehm & S. V. Adve, 2008, Figure 1. (If you follow the link, then, in the paper's section 1, see the first bulleted item: "The informal specifications provided by ...")

THE QUESTION IN ALTERNATE FORM

One answer (appreciated) shows that the question above can be misunderstood. If helpful, then here is the question in alternate form.

Suppose that the programmer tried to tell thread 1 to skip the operation:

    bool a = false;
    if (a) m += 2;

Does the C++ memory model forbid thread 1 from behaving, at run time, as though it did this?

    m += 2; // speculatively alter m
    m -= 2; // oops, should not have altered! reverse the alteration

I ask because Boehm and Adve, earlier linked, seem to explain that a multithreaded execution can

speculatively alter a variable, but then
later change the variable back to its original value when the speculative alteration turns out to have been unnecessary.

COMPILABLE SAMPLE CODE

Here is some code you can actually compile, if you wish.

#include <iostream>
#include <atomic>
#include <thread>

// For the orignial question, do_alter = true.
// For the question in alternate form, do_alter = false.
constexpr bool do_alter = true;

void f1(std::atomic_int *const p, const bool do_alter_)
{
    if (do_alter_) p->fetch_add(2, std::memory_order_relaxed);
}

void f2(const std::atomic_int *const p, std::atomic_int *const q)
{
    q->store(
        p->load(std::memory_order_relaxed),
        std::memory_order_relaxed
    );
}

int main()
{
    std::atomic_int m(5);
    std::atomic_int n(0);
    std::thread t1(f1, &m, do_alter);
    std::thread t2(f2, &m, &n);
    t2.join();
    t1.join();
    std::cout << n << "\n";
    return 0;
}

This code always prints 5 or 7 when I run it. (In fact, as far as I can tell, it always prints 7 when I run it.) However, I see nothing in the semantics that would prevent it from printing 6, 4 or 8.

The excellent Cppreference.com states, "Atomic objects are free of data races," which is nice, but in such a context as this, what does it mean?

Undoubtedly, all this means that I do not understand the semantics very well. Any illumination you can shed on the question would be appreciated.

ANSWERS

@Christophe, @ZalmanStern and @BenVoigt each illuminate the question with skill. Their answers cooperate rather than compete. In my opinion, readers should heed all three answers: @Christophe first; @ZalmanStern second; and @BenVoigt last to sum up.

Even for non-atomic addition addition and assignment, an integer variable can't get "spurious" values like you're worried about. And (in practice) that goes even for possible data-races like you could have. An operation as `m += 2` is equal to `m = m + 2`, not two consecutive increments of one. — Some programmer dude, Nov 05 '17 at 13:16
And why do you ask? Do you have an *actual* problem that leads to this question? Perhaps you should ask about *that* instead? — Some programmer dude, Nov 05 '17 at 13:16
No they cannot. That's what "atomic" means. Indivisible, cannot observe separate parts. — nwp, Nov 05 '17 at 13:21
@Someprogrammerdude: No, I have no actual problem. I am reading Stroustrup's and Williams' chapters on low-level concurrency primitives and lock-free programming. I am also reading Holzmann's book on concurrent model checking, and I have read Boehm's and Adve's paper, as linked. I am trying to understand what I am reading. That's all. — thb, Nov 05 '17 at 13:23
@nwp: That's what I thought, but then I read Boehm's and Adve's paper, as linked. Now I am not so sure. This is why I ask. — thb, Nov 05 '17 at 13:24
@thb by the way, you can observe more results by adding some randomness in the execution ([online demo](https://ideone.com/Fj4Vke)). But you'll find out only 5 and 7 ;-) — Christophe, Nov 05 '17 at 14:32
@Someprogrammerdude: For non-atomic non-volatile operations, variables, including integer variables of all sizes, most certainly can contain intermediate spurious values. Torn writes are the most obviously example, but speculative writes as mentioned in the question are also possible. Atomic and volatile operations avoid these stores of spurious values. (But beware, while read and write are atomic for `volatile`s, read-modify-write operations are not) — Ben Voigt, Nov 05 '17 at 18:28
@Someprogrammerdude: It is very much a possibility that multithreaded code has bugs that will very, very, very rarely show up in real life. Rarely, but not never. I very much prefer code where I _know_ that it works to code that I haven't seen failing (yet). — gnasher729, Nov 05 '17 at 19:05
@BenVoigt Which volatile operations are guaranteed to be atomic? All writes? — curiousguy, Nov 22 '17 at 02:04
@curiousguy: `volatile` accesses have to generate the exact same sequence of memory accesses as are found in the code. So reads and writes have to be atomic (operations that can't be made atomic, such as overwide ones, have to fail in a fully compliant implementation), and increments, decrements, compound assignments have to perform separate read and write operations, exactly one of each. — Ben Voigt, Nov 22 '17 at 04:15
@BenVoigt What about volatile struct in C? Should modification be atomic? — curiousguy, Nov 22 '17 at 22:26
@Someprogrammerdude "_an integer variable can't get "spurious" values like you're worried about_" How would you even *detect* such spurious values? — curiousguy, Nov 22 '17 at 22:36
@curiousguy Because `int` used to be the "natural" world length, and on Intel processors it still is an atomic operation to write to a "natural" (or *smaller!*) word. Not only on Intel processors by the way. I don't know of *any* architecture that doesn't have atomic writes to the "natural" (or smaller) word size integer. — Some programmer dude, Nov 22 '17 at 22:42
@Someprogrammerdude I agree that writes to `int` is usually atomic; but because `int` is defined by the compiler and not the CPU, the question is not about physical architecture but about the programming environment. — curiousguy, Nov 22 '17 at 23:32
@curiousguy: `volatile` on a struct really ends up modifying the member accesses. Each member access should be volatile, there are no entire-struct operations in C. For instance, copying is defined as a sequence of member operations. I'm not sure if the order is specified, if it is than non-volatile memberwise copy could be reordered but volatile struct copy would preserve order. — Ben Voigt, Nov 23 '17 at 05:36

score 23 · Answer 1 · edited Jun 20 '20 at 09:12

23

Your code makes use of fetch_add() on the atomic, which gives the following guarantee:

Atomically replaces the current value with the result of arithmetic addition of the value and arg. The operation is read-modify-write operation. Memory is affected according to the value of order.

The semantics are crystal clear: before the operation it's m, after the operation it's m+2, and no thread accesses to what's between these two states because the operation is atomic.

Edit: additional elements regarding your alternate question

Whatever Boehm and Adve may say, the C++ compilers obey to the following standard clause:

1.9/5: A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input.

If a C++ compiler would generate code that could allow speculative updates to interfere with the observable behavior of the program (aka getting something else than 5 or 7), it would not be standard compliant, because it would fail to ensure the guarantee mentioned in my initial answer.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 05 '17 at 13:36

Christophe

68,716
7
72
138

If you are still interested, I have edited the question to clarify why I understand (or misunderstand) Boehm and Adve to contradict your answer. To the question, I have added a new section, THE QUESTION IN ALTERNATE FORM. – thb Nov 05 '17 at 14:35
3

@thb: Boehm and Adve are pointing out problems with ordinary (non-atomic, non-volatile) memory access. atomic is the tool for avoiding the problems they point out. – Ben Voigt Nov 05 '17 at 18:31
2

@Christophe: However, one must also consider the meaning of "observable behavior" used by the Standard. It includes only observations made by code in the same program. Both C and C++ expressly allow for the possibility of using an explicit lock for implementing the atomic functions, and in that case observations which lack type-safety, such as OS cross-process memory access, memory-mapped hardware, or debugging tools, may still be able to observe violations of the "observable behavior" model. – Ben Voigt Nov 05 '17 at 18:42
@Christophe, atomic reads or writes are not *observable behaviors*, only I/O and volatile writes are *observable behavior*. A very smart compiler could actualy transform the `m+=2;` into `m++` if the value of `n` were just used in a test as in `if (n>5) std::cout<<"bingo";` and `m` not used for anything else than initializing `n`. – Oliv Nov 05 '17 at 19:58
@Oliv yes, but we are playing with the words here. In OP's sample, n gets printed in the end, so we come back again to the value that n should/could have at this moment according to the rules of the standard. The same applies to the conditional print, which should only be printed depending on n, which should then obey the same rules of the standard. You are however right in that an atomic that would be updated but never used afterwards could be optimized away by the compiler (thus completely eliminating the issue of the spurious change) – Christophe Nov 05 '17 at 23:01
Another potentially-interesting question is whether the transformation of `++m; ++m;` to `m += 2;` is allowed. For atomics, it would be. For `volatile`, it is not. – Ben Voigt Nov 05 '17 at 23:09
@BenVoigt I was talking about use of atomics in different threads of a same programme. These have to obey the multi-thread visibility rules and thus can't infringe the atomicity of an atomic operation. Of course, you are completely right in that a CPU level introspection could reveal intermediary steps (especially if the atomics are not lock free) resulting from optimizations. It's just that from the C++ perspective, the other thread can't see that. – Christophe Nov 05 '17 at 23:10
@BenVoigt an atomic `m+=2;` could potentially be implemented with `++m; ++m;` under the condition that the other threads see either the value of m before, or after the two consecutive operations, but not in-between. Note that I would be surprised to find any compiler doing this, although I'd be less surprised of weird arithmetic transformations for some multiplication or division. – Christophe Nov 05 '17 at 23:17

score 21 · Accepted Answer · edited Jun 20 '20 at 09:12

21

The existing answers provide a lot of good explanation, but they fail to give a direct answer to your question. Here we go:

can atomics suffer spurious stores?

Yes, but you cannot observe them from a C++ program which is free from data races.

Only volatile is actually prohibited from performing extra memory accesses.

does the C++ memory model forbid thread 1 from behaving as though it did this?
++m;
++m;

Yes, but this one is allowed:

lock (shared_std_atomic_secret_lock)
{
    ++m;
    ++m;
}

It's allowed but stupid. A more realistic possibility is turning this:

std::atomic<int64_t> m;
++m;

into

memory_bus_lock
{
    ++m.low;
    if (last_operation_did_carry)
       ++m.high;
}

where memory_bus_lock and last_operation_did_carry are features of the hardware platform that can't be expressed in portable C++.

Note that peripherals sitting on the memory bus do see the intermediate value, but can interpret this situation correctly by looking at the memory bus lock. Software debuggers won't be able to see the intermediate value.

In other cases, atomic operations can be implemented by software locks, in which case:

Software debuggers can see intermediate values, and have to be aware of the software lock to avoid misinterpretation
Hardware peripherals will see changes to the software lock, and intermediate values of the atomic object. Some magic may be required for the peripheral to recognize the relationship between the two.
If the atomic object is in shared memory, other processes can see the intermediate values and may not have any way to inspect the software lock / may have a separate copy of said software lock
If other threads in the same C++ program break type safety in a way that causes a data race (For example, using memcpy to read the atomic object) they can observe intermediate values. Formally, that's undefined behavior.

One last important point. The "speculative write" is a very complex scenario. It's easier to see this if we rename the condition:

Thread #1

if (my_mutex.is_held) o += 2; // o is an ordinary variable, not atomic or volatile
return o;

Thread #2

{
    scoped_lock l(my_mutex);
    return o;
}

There's no data race here. If Thread #1 has the mutex locked, the write and read can't occur unordered. If it doesn't have the mutex locked, the threads run unordered but both are performing only reads.

Therefore the compiler cannot allow intermediate values to be seen. This C++ code is not a correct rewrite:

o += 2;
if (!my_mutex.is_held) o -= 2;

because the compiler invented a data race. However, if the hardware platform provides a mechanism for race-free speculative writes (Itanium perhaps?), the compiler can use it. So hardware might see intermediate values, even though C++ code cannot.

If intermediate values shouldn't be seen by hardware, you need to use volatile (possibly in addition to atomics, because volatile read-modify-write is not guaranteed atomic). With volatile, asking for an operation which can't be performed as-written will result in compilation failure, not spurious memory access.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 05 '17 at 19:15

Ben Voigt

277,958
43
419
720

2

The only thing I'm aware of that works like `memory_bus_lock{ multiple ops }` is transactional memory. (e.g. Intel's TSX. https://www.realworldtech.com/haswell-tm/). AFAIK, that's the only way you could implement `int64_t m;` `++m` using the algorithm you describe in assembly language on x86 or any typical RISC. (Most RISC CPUs support [LL/SC](https://en.wikipedia.org/wiki/Load-link/store-conditional), but not nested, so you can't atomically operate on a pair of words. – Peter Cordes Nov 05 '17 at 19:33
2

On x86, you can atomically increment a double-word integer (like `int64_t` on 32-bit x86) using a double-word CAS (which x86 *does* have even without TSX-NI transactional memory: `cmpxchg8b`). Load both words, increment in registers, then try to CAS that back into memory. Else retry. There aren't asm instructions to lock the memory bus. (And of course normally an atomic op only internally locks the one cache line it operates on, i.e. not responding to MESI requests to Invalidate or Share the cache line between the read and the write of an atomic RMW). – Peter Cordes Nov 05 '17 at 19:37
2

Anyway, upvoted because the mechanism you propose is allowed by the C++ standard. No mainstream hardware works that way, but it's an interesting thought experiment. And also for the suggestion to use `volatile atomic` when you care about the internals. – Peter Cordes Nov 05 '17 at 19:38
1

@PeterCordes: Those weren't necessarily meant to correspond to assembly instructions, consider for example x86's #LOCK signal (which that `cmpxchg8b` uses, but can also be controlled from assembly via the `LOCK` prefix) along with multiple memory-bus actions, which can be seen by memory-mapped hardware, cache coherency snoops from other cores, etc. When you take "observable behavior" to the hardware level, it's about the (hierarchical) memory bus cycles not the number of instructions. – Ben Voigt Nov 05 '17 at 19:46
1

@PeterCordes: Anyway, I intended for that example to correspond to x86 `LOCK INC m64` but I don't know what other actions might appear in the uop decode on a current-generation Core CPU. – Ben Voigt Nov 05 '17 at 19:53
1

Right, `cmpxchg8b` is normally used with the `lock` prefix to make it atomic. (It isn't by default.) But for an aligned point to cacheable memory, that doesn't do anything externally visible like assert `#LOCK` ([it just does a much more efficient cache-lock of that line in the core running that operation](https://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num/39396999#39396999)). On an uncacheable MMIO region you'd get an externally-visible signal, though. – Peter Cordes Nov 05 '17 at 19:55
1

`lock inc m64` is only encodeable in 64-bit mode, and does a single 64-bit read-modify-write with a 64-bit load/store and a 64-bit adder in the ALU. The entire 64-bit chunk of memory is read/rewritten. (I'm not 100% sure you could design an experiment that tells the difference with `lock`. Without `lock` it's easy: store-forwarding to a 64-bit load of the same address works, but it wouldn't for two 32-bit `add`/`adc` operations.) The `.low`/`.high` stuff only makes any sense in 32-bit mode where you can't just ask the hardware to do a single 64-bit RMW. – Peter Cordes Nov 05 '17 at 19:58
1

@PeterCordes: Yeah, on modern x86 I guess the closest to that sequence would be a locked and misaligned read-modify-write operation to uncacheable memory. And then it would probably calculate and store the entire new value, even bytes that don't change, inside the bus lock. But there would be multiple bus cycles within a single lock, visible to memory-mapped peripherals and bus snoopers... because x86 just doesn't provide a way to do unaligned write with a single bus cycle. Whether it is actually possible in C++ to create an unaligned `std::atomic` is another question. – Ben Voigt Nov 05 '17 at 22:53
Yeah, that's true, you would get multiple cycles from misaligned. AFAIK, compilers don't have any code-gen for misaligned atomics. If you cast a misaligned pointer to `std::atomic`, or maybe use a `packed` struct, you will still get the regular code-gen, so pure-load and pure-store will just be `mov` and thus not atomic. x86 doesn't have a way to do a misaligned atomic load other than `lock cmpxchg` with desired=expected, but that's really a RMW. gcc uses `lock cmpxchg16b` for `atomic<16B>`, but it makes loads much more expensive. (32-bit mode can use SSE or x87 64-bit loads/stores) – Peter Cordes Nov 05 '17 at 23:02
@PeterCordes Literally locking out other processors from accessing main memory is not an inconceivable feature, especially on embedded systems. – user253751 Nov 05 '17 at 23:31
@immibis: Alternatively, single processor embedded systems very commonly disable interrupts globally as a way to simulate atomicity. – Ben Voigt Nov 06 '17 at 15:15
@BenVoigt Single processor systems generally also can't be interrupted in the middle of an instruction, so the processor doesn't need to take a bus lock. – user253751 Nov 06 '17 at 22:05
@immibis: Right, it's a software lock of sorts (not directly known to hardware peripherals) not a hardware lock, but it is still a way of grouping suboperations to prevent them from being pre-empted. – Ben Voigt Nov 06 '17 at 23:30

Zalman Stern · Answer 3 · 2017-11-05T17:12:26.667

Your revised question differs quite a bit from the first in that we've moved from sequential consistency to relaxed memory order.

Both reasoning about and specifying weak memory orderings can be fairly tricky. E.g. note the difference between C++11 and C++14 specification pointed out here: http://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering . However, the definition of atomicity does prevent the fetch_add call from allowing any other thread to see values other than ones otherwise written to the variable or one of those plus 2. (A thread can do pretty much anything so long as it guarantees the intermediate values are not observable by other threads.)

(To get dreadfully specific, you likely want to search for "read-modify-write" in the C++ spec, e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4659.pdf .)

Perhaps giving a specific reference to the place in the linked paper that you have questions about would help. That paper predates the first C++ concurrent memory model specification (in C++11) by a tiny bit and we're now another rev beyond that so it may also be a bit out of date with respect to what the standard actually says, though I expect this is more an issue of it proposing things that could happen on non-atomic variables.

EDIT: I'll add a bit more about "the semantics" to perhaps help think about how to analyze this kind of thing.

The goal of memory ordering is to establish a set of possible orders between reads and writes to variables across threads. In weaker orderings, it is not guaranteed that there is any single global ordering that applies to all threads. This alone is already tricky enough that one should make sure it is fully understood before moving on.

Two things involved in specifying an ordering are addresses and synchronization operations. In effect a synchronization operation has two sides and those two sides are connected via sharing an address. (A fence can be thought of as applying to all addresses.) A lot of the confusion in the space comes from figuring out when a synchronization operation on one address guarantees something for other addresses. E.g. mutex lock and unlock operations only establish ordering via the acquire and release operations on the addresses inside the mutex, but that synchronization applies to all reads and writes by the threads locking and unlocking the mutex. An atomic variable accessed using relaxed ordering places few constraints on what happens, but those accesses may have ordering constraints imposed by more strongly ordered operations on other atomic variables or mutexes.

The main synchronization operations are acquire and release. See: http://en.cppreference.com/w/cpp/atomic/memory_order . These are names per what happens with a mutex. The acquire operation applies to loads and prevents any memory operations on the current thread from being reordered past the point where the acquire happens. It also establishes an ordering with any prior release operations on the same variable. The last bit is governed by the value loaded. I.e. if the load returns a value from a given write with release synchronization, the load is now ordered against that write and all other memory operations by those threads fall into place according to the ordering rules.

Atomic, or read-modify-write, operations are their own little sequence in the larger ordering. It is guaranteed that the read, the operation, and the write happen atomically. Any other ordering is given by the memory order parameter to the operation. E.g. specifying relaxed ordering says no constraints otherwise apply to any other variables. I.e. there is no acquire or release implied by the operation. Specifying memory_order_acq_rel says that not only is the operation atomic, but that the read is an acquire and the write is a release -- if the thread reads a value from another write with release semantics, all other atomics now have the appropriate ordering constraint in this thread.

A fetch_add with relaxed memory order might be used for an statistics counter in profiling. At the end of the operation, all threads will have done something else to assure all those counter increments are now visible to the final reader, but in the intermediate state we don't care so long as the final total adds up. However this does not imply that intermediate reads can sample values that were never part of the count. E.g. if we're always adding even values to a counter starting at 0, no thread should ever read an odd value regardless of ordering.

I am a bit put off by not being able to point to a specific piece of text in the standard which says there can be no side effects to atomic variables other than those explicitly encoded in the program somehow. Lots of things mention side effects, but it seems to be taken for granted that the side effects are those specified by the source and not anything made up by the compiler. Don't have time to track this down right now, but there is a lot of stuff that would not work if this were not guaranteed and part of the point of std::atomic is to get this constraint as it is not guaranteed by other variables. (It is somewhat provided by volatile, or at least is intended to be. Part of the reason we have this degree of specification for memory ordering around std::atomic is because volatile never became well enough specified to reason about in detail and no one set of constraints met all needs.)

+1. I have edited the question to give a specific reference to the place in the linked paper. Good advice. Good answer. You say that reasoning can be fairly tricky. This is just what I am trying to do now: I am trying to learn the trick! — thb, Nov 05 '17 at 15:47
That text is indeed describing things that can happen in absence of the constraints of a concurrency model. This paper was part of a lead up to specifying all this rather complex stuff in the standard. However those specific possibilities do not seem to apply to `std:atomic` at all. I'll add a bit to my answer on how to think about "the semantics." — Zalman Stern, Nov 05 '17 at 15:58
@Zalman: The reason that the Standard doesn't contain the statement you are looking for is that it isn't true. On some architectures, atomic variables are implemented using a sequence of sub-operations protected by a lock. Actually, that's true on nearly every architecture, the key differences are whether the lock is a hardware or software lock. The guarantee made by `std::atomic` is that C++ code cannot observe the intermediate values. If you need to prevent hardware from observing sub-operations, you need `volatile` (and operations which aren't implemented atomically in hardware will fail) — Ben Voigt, Nov 05 '17 at 18:36
@BenVoigt. Um sorry but that is not what I'm talking about. There have to be some limitations on the behavior, even if guaranteed by a software lock. If the compiler implements all atomic accesses via a lock, then I only care about the values when the lock is not held. I see lots and lots of text about the ordering of side effects, but none about what the set of side effects is. I maintain that if one does atomic fetch_adds of 2 on one thread and atomic reads on the other, the reading thread should only ever see even values. If you believe that is not the case, I'd like to know why. — Zalman Stern, Nov 05 '17 at 18:58
@BenVoigt we could also limit the discussion to types and alignments for which the `std::atomic` is lock free. I don't think it makes any difference if all accesses are via `std::atomic` to start with. — Zalman Stern, Nov 05 '17 at 19:06
@ZalmanStern: I agree with you. A thread performing C++ atomic reads cannot see the intermediate values. But they might be detectable by other means (hardware DMA, bus sniffer, debugging software, inter-process memory access). If someone were performing memory-mapped I/O, or shared memory, and thinking of using `std::atomic` to do so, they might ask this exact question. So I think it is valuable to distinguish between "it can't happen" and "when all accesses are through `std::atomic` you can't see it happen" — Ben Voigt, Nov 05 '17 at 19:21
@ZalmanStern: N4140 29.3.3 (3.3) might be what you're looking for. An atomic load will always see "the result of some modification of M", not some other made-up result that doesn't appear in the C++ abstract machine. (If there's a release-sequence, then there are requirements on *which* modification of M you're allowed to see, so the wording gets super complicated). — Peter Cordes, Nov 05 '17 at 20:06

Can atomics suffer spurious stores?

3 Answers3

Edit: additional elements regarding your alternate question

Yes, but you cannot observe them from a C++ program which is free from data races.