Can a load or store be reordered before a conditional?

Question

std::atomic_uint64_t writing_ {0};
std::atomic_uint64_t reading_ {0};
std::array<type, size> storage_ {};

bool try_enqueue(type t) noexcept
{
    const std::uint64_t writing {
        writing_.load(std::memory_order::memory_order_relaxed)};
    const auto last_read {reading_.load(std::memory_order::memory_order_relaxed)};
    if (writing - last_read < size) {
        storage_.at(writing & (size - 1)) = t;
        writing_.store(writing + 1, std::memory_order::memory_order_release);

        return true;
    }
    else
        return false;
}

In the above code, as I understand it, if the condition evaluates to false, it is impossible for any thread to observe a write to the shared storage. Is it correct that an operation cannot be perceived as having occurred before a conditional it is sequenced after? Or am I completely misreading this and such a thing could actually occur (perhaps via speculative execution?)?

A little more specifically, could the processor speculatively execute the write (when the condition will eventually evaluate to false), another thread observe the write as having occurred, and then the first thread discarding the speculative write?

(note: this is single-producer single-consumer)

I think you can safely suppose that all threads will run as if this write never happened, as long as those threads does not measure cache latency. — Oliv, Sep 14 '19 at 17:41
If I'm reading the code correctly, you are assuming that there is at most one writer thread at any given time, right? Because otherwise, this *seems* trivially wrong (where you store `writing + 1` in `writing_`. — yzt, Sep 14 '19 at 17:41
Can you add the declarations for `writing_` and `storage_`? Otherwise we'll have to guess. — 1201ProgramAlarm, Sep 14 '19 at 17:42
To answer your "more specifically" part, I have to agree with @Oliv as well. The speculative execution path in the CPU will never produce direct results that are observable by any code that is not on the speculative execution path while that path is only "speculative". (By direct result I mean values in registers and memory, as opposed to indirect results like cache timing and whatnot.) — yzt, Sep 14 '19 at 17:44
A race-free program cannot observe an operation `A` that is sequenced-before `B` to have an observable side effect that follows the observable side effect of `B`; or else the compiler is non-conforming (aka buggy). A program containing data races exhibits undefined behavior, and can observe pretty much anything; the C++ standard places no requirements on the execution of such a program. Looking at your example, I don't quite see how it can be executed by multiple threads concurrently and *not* contain a data race. — Igor Tandetnik, Sep 14 '19 at 17:57
@IgorTandetnik try_enqueue is only called on one consuming thread — D. Detweiler, Sep 14 '19 at 22:07
If the CPU chooses to perform speculative execution, then it must do so in a way where neither thread can observe it happening. And since the program can't tell the difference, the question is of mostly academic interest. — Igor Tandetnik, Sep 14 '19 at 23:10
@IgorTandetnik Thank you, that's what I was looking for. So, just to be clear, there is no way the "other" thread could observe the store in question if the condition evaluates to false? — D. Detweiler, Sep 15 '19 at 17:52
Correct. At least, no way that can be produced by a conforming C++ compiler compiling a program that exhibits well-defined behavior. — Igor Tandetnik, Sep 16 '19 at 03:16

Peter Cordes · Accepted Answer · 2021-04-16T21:12:30.120

Is it correct that an operation cannot be perceived as having occurred before a conditional it is sequenced after?

C++ compilers definitely aren't allowed to invent writes to atomic (or volatile) objects.

Compilers aren't even allowed to invent writes to non-atomic objects (e.g. turn a conditional write into read + cmov + write) because C++11 introduced a memory model that makes it well-defined for two threads to run code like that at the same time as long as at most one of them actually writes (and a read is sequenced after). But two non-atomic RMWs could step on each other so don't work "as-if" the C++ abstract machine was running the source code, so it's not an option for the compiler to emit asm that does that.

But if the compiler knows an object is always written, it can pretty much do whatever it wants because a legal program can't observe the difference: that would involve data-race UB.

A little more specifically, could the processor speculatively execute the write (when the condition will eventually evaluate to false), another thread observe the write as having occurred, and then the first thread discarding the speculative write?

No, speculation doesn't escape the core doing the speculation. Otherwise when mis-speculation is detected, all cores would have to roll back their state!

This is one of the primary reasons for store buffers existing: to decouple OoO speculative execution of stores from commit to L1d cache (which is when the store becomes globally visible to other cores). And to decouple execution from cache-miss stores which is useful even on in-order non-speculative CPUs.

Stores don't commit to L1d until after the store instruction has retired from the out-of-order core (i.e. is known to be non-speculative). Retired stores that haven't committed yet are sometimes called "graduated" to distinguish them from other store-buffer entries that could potentially be discarded if the core needs to roll back to retirement state.

This allows hardware speculative execution without inventing writes.

(see also Can a speculatively executed CPU branch contain opcodes that access RAM? for more details. Fun fact: some CPUs, specifically PowerPC, can do store-forwarding of graduated stores between SMT threads on the same physical core, making stores visible to some cores before they become globally visible. But only for graduated stores, otherwise possible mis-speculation could leak out.)

In C++, a std::mo_release store forces the compiler to use sufficient barriers or release-store instructions (e.g. a normal mov on x86 is a release-store, or stlr on AArch64 is a sequential-release store). Or whatever other mechanism to make sure the asm guarantees runtime ordering at least as strong as the C++ abstract machine guarantees.

C++ defines its standard in terms of sequenced-before / after, not barriers, but on any given platform implementations / the ABI standardize on some mapping from std::atomic operations to asm sequences. (e.g. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

`speculation doesn't escape the core` - up until 1-2 years ago everyone agreed that was true. Now, I would phrase that as "doesn't escape architecturally", but you can sometimes still observe it through side channels (although I know the question isn't about that) — Leeor, Sep 17 '19 at 19:51
@Leeor: Yeah, things like Spectre and L1TF can cross *logical* core boundaries on the same physical core. But speculation still doesn't escape the *physical* core. Other than speculative MESI requests which can happen also from HW prefetch instead of speculation. Anyway, yes, good footnote thanks. — Peter Cordes, Sep 17 '19 at 20:01

Can a load or store be reordered before a conditional?

1 Answers1