C/C++: relaxed std::atomic vs unlocked bool on X64 architecture

Question

Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?

"since a single byte is actually atomic in hardware" - that's not a given fact. — Jesper Juhl, Nov 11 '18 at 18:30
Not even on X64 architecture? (Note what I wrote in the title) — tohava, Nov 11 '18 at 18:32
@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't *have* byte load/store instructions, only word. Or word-addressable DSPs. But on them, `bool` would be a word wide, not a byte.) — Peter Cordes, Nov 11 '18 at 19:21

Peter Cordes · Accepted Answer · 2018-11-12T18:47:57.330

Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.

If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.

Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).

This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)

int sumarr_atomic(int arr[]) {
    int sum = 0;
    for(int i=0 ; i<10000 ; i++) {
        if (atomic_bool.load (std::memory_order_relaxed)) {
            sum += arr[i];
        }
    }
    return sum;
}

See the asm output on Godbolt.

But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).

With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.

(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)

Loops over an array of bool can auto-vectorize, but not over atomic<bool> [].

Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)

Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.

void loop() {
    for(int i=0 ; i<10000 ; i++) {
        regular_bool ^= 1;
    }
}

compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.

loop():
    movzx   edx, BYTE PTR regular_bool[rip]   # load into a register
    mov     eax, 10000
.L17:                     # do {
    xor     edx, 1          # flip the boolean
    sub     eax, 1
    jne     .L17          # } while(--i);
    mov     BYTE PTR regular_bool[rip], dl    # store back the result
    ret

Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.

Paul Sanders · Answer 2 · 2018-11-11T18:39:56.910

2

Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.

Clang does the same thing, although the code generated is slightly different in detail.

edited Nov 11 '18 at 18:39

answered Nov 11 '18 at 18:36

Paul Sanders

24,133
4
26
48

Using `cout <<` clutters the code a lot. https://godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like `bool load_regular() { return regular_bool; }` that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.) – Peter Cordes Nov 11 '18 at 18:39
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code. – Paul Sanders Nov 11 '18 at 18:40
Yeah I know, and my point is that returning a value from a function instead of writing a `main` solves the same problem much more cleanly. See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116). Remember you're just writing code so you can look at the asm, not run it. – Peter Cordes Nov 11 '18 at 18:42
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that. – Paul Sanders Nov 11 '18 at 18:44
Even if you did write callers, you can still look at the stand-alone definition *as well*, if you don't make them `static` or `inline`. – Peter Cordes Nov 11 '18 at 18:47

C/C++: relaxed std::atomic vs unlocked bool on X64 architecture

2 Answers2