GCC reordering up across load with `memory_order_seq_cst`. Is this allowed?

Question

Using a simplified version of a basic seqlock , gcc reorders a nonatomic load up across an atomic load(memory_order_seq_cst) when compiling the code with -O3. This reordering isn't observed when compiling with other optimization levels or when compiling with clang ( even on O3 ). This reordering seems to violate a synchronizes-with relationship that should be established and I'm curious to know why gcc reorders this particular load and if this is even allowed by the standard.

Consider this following load function:

auto load()
{
    std::size_t copy;
    std::size_t seq0 = 0, seq1 = 0;
    do
    {
        seq0 = seq_.load();
        copy = value;
        seq1 = seq_.load();
    } while( seq0 & 1 || seq0 != seq1);

    std::cout << "Observed: " << seq0 << '\n';
    return copy;
}

Following seqlock procedure, this reader spins until it is able to load two instances of seq_, which is defined to be a std::atomic<std::size_t>, that are even ( to indicate that a writer is not currently writing ) and equal ( to indicate that a writer has not written to value in between the two loads of seq_ ). Furthermore, because these loads are tagged with memory_order_seq_cst ( as a default argument ), I would imagine that the instruction copy = value; would be executed on each iteration as it cannot be reordered up across the initial load, nor can it reordered down below the latter.

However, the generated assembly issues the load from value before the first load from seq_ and is even performed outside of the loop. This could lead to improper synchronization or torn reads of value that do not get resolved by the seqlock algorithm. Additionally, I've noticed that this only occurs when sizeof(value) is below 123 bytes. Modifying value to be of some type >= 123 bytes yields the correct assembly and is loaded upon each loop iteration in between the two loads of seq_. Is there any reason why this seemingly arbitrary threshold dictates which assembly is generated?

This test harness exposes the behavior on my Xeon E3-1505M, in which "Observed: 2" will be printed from the reader and the value 65535 will be returned. This combination of observed values of seq_ and the returned load from value seem to violate the synchronizes-with relationship that should be established by the writer thread publishing seq.store(2) with memory_order_release and the reader thread reading seq_ with memory_order_seq_cst.

Is it valid for gcc to reorder the load, and if so, why does it only do so when sizeof(value) is < 123? clang, no matter the optimization level or the sizeof(value) will not reorder the load. Clang's codegen, I believe, is the appropriate and correct approach.

I think you should report it to gcc's bugzilla. Gimple optimizers preserve it, but in RTL, the passes pre+cse_local do the transformation. — Marc Glisse, Apr 30 '16 at 21:52
`-fdump-rtl-all` then look at the dumps. It seems to be specific to this target though. — Marc Glisse, May 01 '16 at 21:18
You should probably indicate a specific C++ std version for these fine questions. — curiousguy, Dec 12 '19 at 04:29
For reference, while the questionable behavior happens with gcc 7.5, it doesn't with gcc 8.1 or later. — Marc Glisse, Jun 25 '22 at 16:15

BeeOnRope · Answer 1 · 2018-05-29T02:30:23.520

Congratulations, I think you've hit a bug in gcc!

Now I think you can make a reasonable argument, as the other answer does, that the original code you showed could perhaps have been correctly optimized that way by gcc by relying on a fairly obscure argument about the unconditional access to value: essentially you can't have been relying on a synchronizes-with relationship between the load seq0 = seq_.load(); and the subsequent read of value, so reading it "somewhere else" shouldn't change the semantics of a race-free program. I'm not actually sure of this argument, but here's a "simpler" case I got from reducing your code:

#include <atomic>
#include <iostream>

std::atomic<std::size_t> seq_;
std::size_t value;

auto load()
{
    std::size_t copy;
    std::size_t seq0;
    do
    {
        seq0 = seq_.load();
        if (!seq0) continue;
        copy = value;
        seq0 = seq_.load();
    } while (!seq0);

    return copy;
}

This isn't a seqlock or anything - it just waits for seq0 to change from zero to non-zero, and then reads value. The second read of seq_ is superfluous as is the while condition, but without them the bug goes away.

This is now the read-side of the well known idiom which does work and is race-free: one thread writes to value, then sets seq0 non-zero with a release store. The threads calling load see the non-zero store, and synchronize with it, and so can safely read value. Of course, you can't keep writing to value, it's a "one time" initialization, but this a common pattern.

With the above code, gcc is still hoisting the read of value:

load():
        mov     rax, QWORD PTR value[rip]
.L2:
        mov     rdx, QWORD PTR seq_[rip]
        test    rdx, rdx
        je      .L2
        mov     rdx, QWORD PTR seq_[rip]
        test    rdx, rdx
        je      .L2
        rep ret

Oops!

This behavior occurs up to gcc 7.3, but not in 8.1. Your code also compiles as you wanted in 8.1:

    mov     rbx, QWORD PTR seq_[rip]
    mov     rbp, QWORD PTR value[rip]
    mov     rax, QWORD PTR seq_[rip]

@rnpl - I'm not following. By "second read" do you mean the second read shown in the source, directly above the line `} while (!seq0);`? If that read has occurred, copy has already been assigned in the line directly above. I don't see any way this loop can exit without setting copy: it can only exit by reaching the `while` at the bottom, which means that the previous two lines which have no control flow must also have executed, and the first of these sets `copy`. — BeeOnRope, Jun 25 '22 at 16:34
Hum ok, seems I misread the code. yeah, looks like gcc did a redundant load elimination that it wasn't allowed to do, which made it think it could hoist the value when it couldn't do so consistent with seq_cst. — rnpl, Jun 26 '22 at 20:09

rnpl · Answer 2 · 2022-06-26T20:06:00.920

2

Reordering such operations is not allowed in general, but it is allowed in this case because any concurrently executing code which would yield a different result must invoke undefined behavior by creating a race condition in the read by interleaving a non-atomic read and (atomic or non-atomic) write in different threads.

The C++11 standard says:

Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.

And also that:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

This applies even to things that occur before the undefined behavior:

A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input. However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

Because non-atomic reading from the write there creates undefined behavior (even if you overwrite and ignore the value), GCC is allowed to assume it does not occur and thus optimize out the seqlock. It can do so because any initial (acquired) state which would cause the loop to execute multiple times does not guard against subsequent race conditions from the non-atomic read as any subsequent atomic or non-atomic write to the variable beyond the initially acquired state does not establish a guaranteed synchronize-with relationship with the load operation before the non-atomic read. That is to say, the write could occur to the non-atomic read variable inbetween the execution of the seq cst load and the subsequent read, which is a race condition. The fact this "could" occur is a pointer to the lack of synchronizes with relationship and hence undefined behavior, so the compiler may assume it doesn't happen, which allows it to assume that no concurrent write whatsoever will happen to that variable during the loop.

edited Jun 26 '22 at 20:06

answered May 28 '18 at 22:44

rnpl

87
9

`-O2` still does a lot of optimizations; do you have any evidence that it would make this UB safe? (Upvoted because you correctly point out that the `value` in a seqlock needs to be atomic as well.) But you need the `value.load()` to happen between the two `seq.load()`s, and not reorder with either of them. Acquire only blocks reordering in 1 direction (http://preshing.com/20120913/acquire-and-release-semantics/), so I think you need `value.load()` to be an acquire-load as well. The 2nd load from `seq` can be relaxed though, and still be guaranteed to happen after `value.load(mo_acquire)`. – Peter Cordes May 29 '18 at 01:17
Oh, in this case the OP says that it happens to work for them with `gcc -O2`. But there's no reason to assume it's safe in general for other targets (especially non-x86 where regular loads don't have acquire semantics for free). – Peter Cordes May 29 '18 at 01:24
1

@PeterCordes - I don't think the compiler is relying on some complex proof of UB here: see my answer where it seems that similar code which _should_ be safe also appears to be compiled unsafely. – BeeOnRope May 29 '18 at 02:24
1

You can't rely on optimizations staying in 1, 2 or 3 levels in a compiler. Different releases of the compiler can move them around. – Zan Lynx May 29 '18 at 02:27
@BeeOnRope You might be right that this is a bug, but as far as the question goes, it is allowed. I added some citations to clarify that it causes undefined behavior. :) – rnpl May 29 '18 at 21:36
@Exaeta - how do you know there is a data race though? The visible program doesn't even contain _any_ writes at all. You "creating a race condition in the read by interleaving a non-atomic read and write in different threads" - but there is no such interleaving in the code, only in the prose describing how a seqlock works. I still think you might be right, but the argument is more subtle (i.e., the read can be moved if gcc can prove that in any scenario where the result could have differed there is also a race). It is not necessarily illegal to write and read from the same non-atomic variable. – BeeOnRope May 29 '18 at 22:19
I was talking about the full test case that he linked. – rnpl May 30 '18 at 22:25
@cuirousguy - I'm not quite following you. The part you quoted is from the OP, who seems to believe that the codegen is wrong. We are discussing data race here because this answer says "the compilation is valid because you have a data race therefore UB, therefore anything goes". I'm pointing out that there isn't actually any data race in the shown code, so that argument seems at least incomplete. I agree with the rest of your comment but I'm not sure what point you are trying to make. – BeeOnRope Dec 12 '19 at 13:27
@curi - most of my replies are written on mobile where that doesn't work. – BeeOnRope Dec 15 '19 at 10:09
@BeeOnRope I understand. My previous explanation was not great let me start over (I'm deleting the previous comments). – curiousguy Dec 16 '19 at 00:36
@BeeOnRope "_this answer says "the compilation is valid because you have a data race therefore UB, therefore anything goes"_" Obviously we are looking at isolation at code compiled in isolation (there is no main!). The compiler is thus like us: it doesn't see the other side of synchronization primitive: when compiling an acquire, it assumes there is a release somewhere. When there is a modification of a shared ordinary object S (non atomic), **the compiler assume that the code implements mutual exclusion on accesses on S.** – curiousguy Dec 16 '19 at 00:55
Let M be a mutex and A an `atomic` and "=>" mean "can be rewritten". Do you agree that `S=1; m.lock(); S=2; m.unlock();` => `S=2; M.lock(); M.unlock();` w/o looking at other code? Also `S=1; A=1; S=2;` => `A=1; S=2;` (or alt. => `S=2; A=1;`)? There is no data race allowed on S so the compiler knows that if it's manipulated outside mutual exclusion, the mutual exclusion device doesn't apply to the variable and the code can be moved. The question here is what reordering is possible on the acquire side (slightly more complicated). I will post an answer explaining the argument. – curiousguy Dec 16 '19 at 01:00
@BeeOnRope Maybe the better phrasing is that any condition which would change the result is a data race? – rnpl Dec 16 '19 at 21:39
@BeeOnRope The original code he linked here: http://coliru.stacked-crooked.com/a/84f84fda040b8324 does contain a race condition. But I'll acknowledge that the answer should be more clear about why it is a correct (if unlikely) optimization, and also on the other link he did not include a main with. – rnpl Dec 16 '19 at 22:02

GCC reordering up across load with `memory_order_seq_cst`. Is this allowed?

2 Answers2

Linked