Purpose of _Compiler_barrier() on 32bit read

Question

I have been stepping through the function calls that are involved when I assign to an atomic_long type on VS2017 with a 64bit project. I specifically wanted to see what happens when I copy an atomic_long into a none-atomic variable, and if there is any locking around it.

atomic_long ll = 10;
long t2 = ll;

Ultimately it ends up with this call (I've removed some code that was ifdefed out)

inline _Uint4_t _Load_seq_cst_4(volatile _Uint4_t *_Tgt)
    {   /* load from *_Tgt atomically with
            sequentially consistent memory order */
    _Uint4_t _Value;

    _Value = *_Tgt;
    _Compiler_barrier();

    return (_Value);
    }

Now, I've read from MSDN that a plain read of a 32bit value will be atomic:

Simple reads and writes to properly-aligned 32-bit variables are atomic operations.

...which explains why there is no Interlocked function for just reading; only those for changing/comparing. What I'd like to know is what the _Compiler_barrier() bit is doing. This is #defined as

__MACHINE(void _ReadWriteBarrier(void))

...and I've found on MSDN again that this

Limits the compiler optimizations that can reorder memory accesses across the point of the call.

But I don't get this, as there are no other memory accesses apart from the return call; surely the compiler wouldn't move the assignment below that would it?

Can someone please clarify the purpose of this barrier?

You underestimate what the optimizer is allowed to do. It most certainly is allowed to reorder memory accesses and routinely does so, if that does not alter the observable behavior of the program. From the point of view of a single thread. The barrier ensures that it is forbidden from doing so from the point of view of another thread. — Hans Passant, Apr 23 '18 at 21:18

Peter Cordes · Accepted Answer · 2018-04-24T18:06:20.990

1

_Load_seq_cst_4 is an inline function. The compiler barrier is there to block reordering with later code in the calling function this inlines into.

For example, consider reading a SeqLock. (Over-simplified from this actual implementation).

#include <atomic>
atomic<unsigned> sequence;
atomic_long  value;

long seqlock_try_read() {
    // this would normally be the body of a retry-loop;
    unsigned seq1 = sequence;
    long tmpval = value;
    unsigned seq2 = sequence;

    if (seq1 == seq2 && (seq1 & 1 == 0)
        return tmpval;
    else
        // writer was modifying it, we should retry the loop
}

If we didn't block compile-time reordering, the compiler could merge both reads of sequence into a single access, like perhaps like this

    long tmpval = value;
    unsigned seq1 = sequence;
    unsigned seq2 = sequence;

This would defeat the locking mechanism (where the writer increments sequence once before modifying the data, then again when it's done). Readers are entirely lockless, but it's not a "lock-free" algo because if the writer gets stuck mid-update, the readers can't read anything.

The barrier within each load function blocks reordering with other things after inlining.

(The C++11 memory model is very weak, but the x86 memory model is strong, only allowing StoreLoad reordering. Blocking compile-time reordering with later loads/stores is sufficient to give you an acquire / sequential-consistency load at runtime. x86: Are memory barriers needed here?)

BTW, a better example might be something where some non-atomic variables are read/written after seeing a certain value in an atomic flag. MSVC probably already avoids reordering or merging of atomic accesses, and in the seqlock the data being protected also has to be atomic.

Why don't compilers merge redundant std::atomic writes?

edited Apr 24 '18 at 18:06

answered Apr 23 '18 at 20:20

Peter Cordes

328,167
45
605
847

Thanks Peter. I'm not quite clear, I see that `_Load_seq_cst_4` is inline, but are you saying that `_ReadWriteBarrier()` is just an inline function (that does nothing?). Earlier today I read about that technique - https://blogs.oracle.com/d/compiler-memory-barriers - a function call prevents code re-ordering. Is that all that is happening here? If so, as the linked article shows, a function call is a mighty expensive way or preventing re-ordering...or have I misunderstood you? – Wad Apr 23 '18 at 20:38
@Wad: No, I'm saying that `_Load_seq_cst_4` itself is an inline function, so you have to worry about reordering of its operations with other stuff in its parent. Updated the answer with an example. – Peter Cordes Apr 23 '18 at 20:58
Right, thanks for coming back Peter. I've read and re-read your comment several times. Can you just clarify; it this code is inlined, that means the return value could be directly written to a local variable. Thus, without the barrier, we could end up with code that looks like `some_local_variable = _Value; _Value = *_Tgt;` **which is assigning the local variable `_Value` before `_Value` has been updated, correct? ** – Wad Apr 24 '18 at 15:28
@Wad: no, that compile-time reordering wouldn't be equivalent to the source as written, so the "as-if" rule doesn't allow it. http://preshing.com/20120625/memory-ordering-at-compile-time/ – Peter Cordes Apr 24 '18 at 17:30
OK. I am actually familiar with the preshing website and have referred to it on many occasions. Can you give an example of compiler-reordering in the context of `_Load_seq_cst_4` that could occur if the compiler barrier wasn't there, please? – Wad Apr 24 '18 at 17:35
@Wad: I already did in my answer. Edited to make it more explicit. – Peter Cordes Apr 24 '18 at 18:06
Thanks, I understand now. One more question - about `_Compiler_barrier()`; does this ensure that all other CPUs will see **the new values** of any values that might be shared (ie synchronize them) or is that a job for `_Memory_barrier()`? Do you know? – Wad Apr 25 '18 at 09:26
I tried to answer this myself, by stepping into `_Store_seq_cst_4()` to see if there was some sort of barrier there; there wasn't on my platform, just an `_InterlockedExchange()` call. Can you tell me if the `Interlocked()` family of functions synchronize access, that is, all other CPUs will see the updated value after it has been updated and not act on an old, stale value (ie their probe queues will be updated to say that the value they hold is now stale)? – Wad Apr 25 '18 at 09:29

Purpose of _Compiler_barrier() on 32bit read

1 Answers1

Linked