How can I show that volatile assignment is not atomic?

Question

I understand that assignment may not be atomic in C++. I am trying to trigger a race condition to show this.

However my code below does not seem to trigger any such. How can I change it so that it will eventually trigger a race condition?

#include <iostream>
#include <thread>

volatile uint64_t sharedValue = 1;
const uint64_t value1 = 13;
const uint64_t value2 = 1414;

void write() {
    bool even = true;
    for (;;) {
        uint64_t value;
        if (even)
            value = value1;
        else
            value = value2;
        sharedValue = value;
        even = !even;
    }
}

void read() {
    for (;;) {
        uint64_t value = sharedValue;
        if (value != value1 && value != value2) {
            std::cout << "Race condition! Value: " << value << std::endl << std::flush;
        }
    }
}

int main()
{
    std::thread t1(write);
    std::thread t2(read);
    t1.join();
}

I am using VS 2017 and compiling in Release x86.

Here is the disassembly of the assignment:

sharedValue = value;
00D54AF2  mov         eax,dword ptr [ebp-18h]  
00D54AF5  mov         dword ptr [sharedValue (0D5F000h)],eax  
00D54AFA  mov         ecx,dword ptr [ebp-14h]  
00D54AFD  mov         dword ptr ds:[0D5F004h],ecx

I guess this means that the assignment is not atomic? Seems 32 of the bits are copied into the general purpose 32 bit register eax and the other 32 bits are copied into another general purpose 32 bit register ecx before being copied into sharedValue which is in the data segment register?

I also tried with uint32_t and all data was copied in one go. So I guess that on x86 there is no need to use std::atomic for 32 bit data types?

Add some delay before `sharedValue = value;` statement. Anyway it is UB to read and write to `sharedValue` without any synchronization. — ks1322, Feb 29 '20 at 11:11
There is no guarantee that code which has a race condition will trigger symptoms of that race condition. You can try adding small delays (even small random delays) into your loops in one or both threads, but that only increases likelihood of seeing symptoms - it doesn't give a guarantee. Depending on the target machine, machine instructions for copying integral values may actually be atomic - there is no requirement that an implementation not use atomic operations in unsynchronised code. — Peter, Feb 29 '20 at 13:19
@Peter: do not the assembly code I have posted show that in my case the assignment is not atomic? — Andy, Feb 29 '20 at 13:28
@Peter: BTW thank you for the suggestion to use random delays. I am trying it out now. I understand that it may take a long time before an error occurs. I tried the same with the += operator however and this caused a race condition in about 10 secs. — Andy, Feb 29 '20 at 13:40
Given that you have x86, and 32-bit writes are atomic on x86, and both values which can be written have zero for their higher 32-bits, there is no way you will detect a race even if it does happen - taking the high 32 bits of one and the low 32 bits of the other will still give a number equal to value1 or value2. Change value1 or value2 so they differ in both high and low 32 bits. — Pete Kirkham, Feb 29 '20 at 16:07
@Peter: Delays are not useful here. Stores will commit from the store buffer into coherent L1d cache as fast as the HW can do it, and having the writer spinning on that while the reader spams requests to share that cache line will result in seeing lots of tearing quickly (once you choose values that differ in both halves) — Peter Cordes, Mar 02 '20 at 04:06

Peter Cordes · Accepted Answer · 2020-03-02T18:05:33.770

Some answers/comments suggested sleeping in the writer. This is not useful; hammering away on the cache line changing it as often as possible is what you want. (And what you get with volatile assignments and reads.) An assignment will be torn when a MESI share request for the cache line arrives at the writer core between committing two halves of a store from the store buffer to L1d cache.

If you sleep, you're waiting a long time without creating a window for that to happen. Sleeping between halves would make it even more easy to detect, but you can't do that unless you use separate memcpy to write halves of the 64-bit integer or something.

Tearing between reads in the reader is also possible even if writes are atomic. This may be less likely, but still happens plenty in practice. Modern x86 CPUs can execute two loads per clock cycle (Intel since Sandybridge, AMD since K8). I tested with atomic 64-bit stores but split 32-bit loads on Skylake and tearing is still frequent enough to spew lines of text in a terminal. So the CPU didn't manage to run everything in lock-step with corresponding pairs of reads always executing in the same clock cycle. So there is a window for the reader getting its cache line invalidated between a pair of loads. (However, all the pending cache-miss loads while the cache line is owned by the writer core probably complete all at once when the cache line arrives. And the total number of load buffers available is an even number in existing microarchitectures.)

As you discovered, your test values both had the same upper half of 0, so this made it impossible to observe any tearing; only the 32-bit aligned low half was ever changing, and was changing atomically because your compiler guarantees at least 4-byte alignment for uint64_t, and x86 guarantees that 4-byte aligned loads/stores are atomic.

0 and -1ULL are the obvious choices. I used the same thing in a test-case for this GCC C11 _Atomic bug for a 64-bit struct.

For your case, I'd do this. read() and write() are POSIX system call names so I picked something else.

#include <cstdint>
volatile uint64_t sharedValue = 0;  // initializer = one of the 2 values!

void writer() {
    for (;;) {
        sharedValue = 0;
        sharedValue = -1ULL;  // unrolling is vastly simpler than an if
    }
}

void reader() {
    for (;;) {
        uint64_t val = sharedValue;
        uint32_t low = val, high = val>>32;
        if (low != high) {
            std::cout << "Tearing! Value: " << std::hex << val << '\n';
        }
    }
}

MSVC 19.24 -O2 compiles the writer to using a movlpd 64-bit store for the = 0, but two separate 32-bit stores of -1 for the = -1. (And the reader to two separate 32-bit loads). GCC uses a total of four mov dword ptr [mem], imm32 stores in the writer, like you'd expect. (Godbolt compiler explorer)

Terminology: it's always a race condition (even with atomicity you don't know which of the two values you're going to get). With std::atomic<> you'd only have that garden-variety race condition, no Undefined Behaviour.

The question is whether you actually see tearing from the data race Undefined Behaviour on the volatile object, on a specific C++ implementation / set of compile options, for a specific platform. Data race UB is a technical term with a more specific meaning than "race condition". I changed the error message to report the one symptom we check for. Note that data-race UB on a non-volatile object can have way weirder effects, like hosting the load or store out of loops, or even inventing extra reads leading to code that thinks one read was both true and false at the same time. (https://lwn.net/Articles/793253/)

I removed 2 redundant cout flushes: one from std::endl and one from std::flush. cout is line-buffered by default, or full buffered if writing to a file, which is fine. And '\n' is just as portable as std::endl as far as DOS line endings are concerned; text vs. binary stream mode handles that. endl is still just \n.

I simplified your check for tearing by checking that high_half == low_half. Then the compiler just has to emit one cmp/jcc instead of two extended-precision compares to see if the value is either 0 or -1 exactly. We know that there's no plausible way for false-negatives like high = low = 0xff00ff00 to happen on x86 (or any other mainstream ISA with any sane compiler).

So I guess that on x86 there is no need to use std::atomic for 32 bit data types?

Incorrect.

Hand-rolled atomics with volatile int can't give you atomic RMW operations (without inline asm or special functions like Windows InterlockedIncrement or GNU C builtin __atomic_fetch_add), and can't give you any ordering guarantees wrt. other code. (Release / acquire semantics)

When to use volatile with multi threading? - pretty much never.

Rolling your own atomics with volatile is still possible and de-facto supported by many mainstream compilers (e.g. the Linux kernel still does that, along with inline asm). Real-world compilers do effectively define the behaviour of data races on volatile objects. But it's generally a bad idea when there's a portable and guaranteed-safe way. Just use std::atomic<T> with std::memory_order_relaxed to get asm that's just as efficient as what you could get with volatile (for the cases where volatile works), but with guarantees of safety and correctness from the ISO C++ standard.

atomic<T> also lets you ask the implementation whether a given type can be cheaply atomic or not, with C++17 std::atomic<T>::is_always_lock_free or the older member function. (In practice C++11 implementations decided not to let some but not all instances of any given atomic be lock free based on alignment or something; instead they just give atomic the required alignas if there is one. So C++17 made a constant per-type constant instead of per-object member function way to check lock-freedom).

std::atomic can also give cheap lock-free atomicity for types wider than a normal register. e.g. on ARM, using ARMv6 strd / ldrd to store/load a pair of registers.

On 32-bit x86, a good compiler can implement std::atomic<uint64_t> by using SSE2 movq to do atomic 64-bit loads and stores, without falling back to the non-lock_free mechanism (a table of locks). In practice, GCC and clang9 do use movq for atomic<uint64_t> load/store. clang8.0 and earlier uses lock cmpxchg8b unfortunately. MSVC uses lock cmpxchg8b in an even more inefficient way. Change the definition of sharedVariable in the Godbolt link to see it. (Or if you use one each of default seq_cst and memory_order_relaxed stores in the loop, MSVC for some reason calls a ?store@?$_Atomic_storage@_K$07@std@@QAEX_KW4memory_order@2@@Z helper function for one of them. But when both stores are the same ordering, it inlines lock cmpxchg8b with much clunkier loops than clang8.0) Note that this inefficient MSVC code-gen is for a case where volatile wasn't atomic; in cases where it is, atomic<T> with mo_relaxed compiles nicely, too.

You generally can't get that wide-atomic code-gen from volatile. Although GCC does actually use movq for your if() bool write function (see the earlier Godbolt compiler explorer link) because it can't see through the alternating or something. It also depends on what values you use. With 0 and -1 it uses separate 32-bit stores, but with 0 and 0x0f0f0f0f0f0f0f0fULL you get movq for a usable pattern. (I used this to verify that you can still get tearing from just the read side, instead of hand-writing some asm.) My simple unrolled version compiles to just use plain mov dword [mem], imm32 stores with GCC. This is a good example of there being zero guarantee of how volatile really compiles in this level of detail.

atomic<uint64_t> will also guarantee 8-byte alignment for the atomic object, even if plain uint64_t might have only been 4-byte aligned.

In ISO C++, a data race on a volatile object is still Undefined Behaviour. (Except for volatile sig_atomic_t racing with a signal handler.)

A "data race" is any time two unsynchronized accesses happen and they're not both reads. ISO C++ allows for the possibility of running on machines with hardware race detection or something; in practice no mainstream systems do that so the result is just tearing if the volatile object isn't "naturally atomic".

ISO C++ also in theory allows for running on machines that don't have coherent shared memory and require manual flushes after atomic stores, but that's not really plausible in practice. No real-world implementations are like that, AFAIK. Systems with cores that have non-coherent shared memory (like some ARM SoCs with DSP cores + microcontroller cores) don't start std::thread across those cores.

See also Why is integer assignment on a naturally aligned variable atomic on x86?

It's still UB even if you don't observe tearing in practice, although as I said real compilers do de-facto define the behaviour of volatile.

Skylake experiments to try to detect store-buffer coalescing

I wondered if store coalescing in the store buffer could maybe create an atomic 64-bit commit to L1d cache out of two separate 32-bit stores. (No useful results so far, leaving this here in case anyone's interested or wants to build on it.)

I used a GNU C __atomic builtin for the reader, so if the stores also ended up being atomic we'd see no tearing.

void reader() {
    for (;;) {
        uint64_t val = __atomic_load_n(&sharedValue, __ATOMIC_ACQUIRE);
        uint32_t low = val, high = val>>32;
        if (low != high) {
            std::cout << "Tearing! Value: " << std::hex << val << '\n';
        }
    }
}

This was one attempt get the microachitecture to group the stores.

void writer() {
    volatile int separator;  // in a different cache line, has to commit separately
    for (;;) {
        sharedValue = 0;

        _mm_mfence();
        separator = 1234;
        _mm_mfence();
        sharedValue = -1ULL;  // unrolling is vastly simpler than an if

        _mm_mfence();
        separator = 1234;
        _mm_mfence();
    }
}

I still see tearing with this. (mfence on Skylake with updated microcode is like lfence, and blocks out-of-order exec as well as draining the store buffer. So later stores shouldn't even enter the store buffer before the later ones leave. That might actually be a problem, because we need time for merging, not just committing a 32-bit store as soon as it "graduates" when the store uops retire).

Probably I should try to measure the rate of tearing and see if it's less frequent with anything, because any tearing at all is enough to spam a terminal window with text on a 4GHz machine.

OMGtechy · Answer 2 · 2020-02-29T11:21:05.633

4

Grab the disassembly and then check the documentation for your architecture; on some machines you'll find even standard "non-atomic" operations (in terms of C++) are actually atomic when it hits the hardware (in terms of assembly).

With that said, your compiler will know what is and isn't safe and it is therefore a better idea to use the std::atomic template to make your code more portable across architectures. If you're on a platform that doesn't require anything special, it'll typically get optimised down to a primitive type anyway (putting memory ordering aside).

I don't remember the details of x86 operations off hand, but I would guess that you have a data race if the 64 bit integer is written in 32 bit "chunks" (or less); it is possible to get a torn read that case.

There are also tools called thread sanitizer to catch it in the act. I don't believe they're supported on Windows with MSVC, but if you can get GCC or clang working then you might have some luck there. If your code is portable (it looks it), then you can run it on a Linux system (or VM) using these tools.

edited Feb 29 '20 at 11:21

answered Feb 29 '20 at 11:12

OMGtechy

7,935
8
48
83

Thank you! I have included the disassembly code and it seems that 32 bit variables are atomic whereas 64 bit variables are not since x86 uses 32 bit registers? I understand that it is anyway better to use atomic since it makes the code more portable. I have also measured this to have zero overhead. In fact it was faster than not atomic!? Mutexes on the other hand was slow. – Andy Feb 29 '20 at 12:16
2

@Andy note that 1 instruction is not the same as atomic (although I believe it may still be for a 32 bit int on x86...). Mutexes will indeed be a LOT slower! It's worth having a read about how they typically work under the hood :) – OMGtechy Feb 29 '20 at 21:15
1

@Andy Portability is obtained by the use of the proper inter thread communication primitive, not volatile tricks. – curiousguy Mar 02 '20 at 03:24

Andy · Answer 3 · 2020-02-29T14:22:02.283

4

I changed the code to:

volatile uint64_t sharedValue = 0;
const uint64_t value1 = 0;
const uint64_t value2 = ULLONG_MAX;

and now the code triggers the race condition in less than a second. The problem was that both 13 and 1414 has the 32 MSB = 0.

13=0xd
1414=0x586
0=0x0
ULLONG_MAX=0xffffffffffffffff

edited Feb 29 '20 at 14:22

answered Feb 29 '20 at 14:04

Andy

3,251
4
32
53

1

Yup, this is the answer I was going to post. 0 and -1ULL are what I used as a test-case for [this GCC C11 _Atomic bug](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4). You can check for tearing by checking that high_half == low_half of e.g. a 64-bit struct of 2 int members, or `unsigned high = val >> 32`. Then the compiler just has to emit one cmp/jcc instead of two extended-precision compares to see if the value is either 0 or -1 exactly. – Peter Cordes Mar 02 '20 at 03:25
*in less than a second.* Were you testing with a sleep in the writer? It triggers basically instantly for me, on i7-6700k Skylake on Arch Linux. (OS shouldn't matter; it's not involved once the threads are started and running on separate cores.) – Peter Cordes Mar 02 '20 at 18:09

score 1 · Answer 4 · answered Feb 29 '20 at 11:43

1

First, your code has a data race condition when reading and writing to sharedValue variable without any synchronization which is Undefined Behavior in C++. This can be fixed by making sharedValue an atomic variable:

std::atomic<uint64_t> sharedValue{1};

You can trigger a logical race condition by using an artificial delay before writing to sharedValue ("Race condition! Value: ..." message will be printed in reader thread). You can use std::this_thread::sleep_for for this:

void write() {
    bool even = true;
    for (;;) {
        uint64_t value;
        if (even)
            value = value1;
        else
            value = value2;

        using namespace std::chrono_literals;
        std::this_thread::sleep_for(1ms);

        sharedValue = value;
        even = !even;
    }
}

answered Feb 29 '20 at 11:43

ks1322

33,961
14
109
164

Note that a volatile variable has different semantics: it must be accessed according to the ABI. No arch has any prohibition on data races so the code is OK on all existing arch. – curiousguy Mar 02 '20 at 02:59
1

Of course it has data race UB; demonstrating one practical effect of that UB (tearing) is the point of the question. (But yes, the phrasing of the question does seem slightly confused.) In practice `volatile` was (and sometimes still is) used for hand-rolled atomics, but only works for types small enough to be "naturally" atomic. This is de-facto supported by several real compilers, even though `volatile` in ISO C++ is still subject to data-race UB. [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) (never, use atomic with mo_relaxed to get the same asm) – Peter Cordes Mar 02 '20 at 03:09
2

Also, sleeping is the opposite of useful. Had to downvote this because reducing the frequency of changing the shared value by a factor of thousands to millions is not helpful. (1 per 1ms vs. multiple per nanosecond without contention of a reader, or maybe one per 100 ns even if every store has to wait for 1 share request and one RFO on a typical x86 where that's maybe 40 to 60 ns between separate physical cores.) – Peter Cordes Mar 02 '20 at 04:00
@PeterCordes, there were several race conditions in original code. The first one can be fixed by using `std::atomic` which I mentioned in the answer. But the second one stays unfixed. The sleep delay makes the second race condition reproducible, provided that the first one is fixed. – ks1322 Mar 02 '20 at 14:32
1

The only data-race UB I see in the OP's code is the unsynchronized read + write of the `volatile` sharedValue. So although it's UB in ISO C++, the behaviour for a specific platform like x86 MSVC is somewhat defined by that implementation. Using `atomic` would still leave a garden variety "race condition" (not UB) - you simply won't know what which of the two possible values you're going to get. But you're guaranteed to get one or the other, and that there's no UB. – Peter Cordes Mar 02 '20 at 14:39
1

Yes, this is what I am talking about. I answered how to trigger a logical (not UB) race condition in UB free code (after using `atomic` to remove UB). – ks1322 Mar 02 '20 at 14:46
Ah, ok. Sleeping doesn't help to show that, though. BTW, I did just notice another race condition in the OP's code: the static initializer for `sharedVariable` isn't one of the two values that the reader checks for! So if the reader's first read beats the writer, it will report tearing :P I updated my answer to make the point about terminology (garden variety race condition vs. data-race UB) more explicitly. – Peter Cordes Mar 02 '20 at 16:10

How can I show that volatile assignment is not atomic?

4 Answers4

Skylake experiments to try to detect store-buffer coalescing