Relaxed Atomics and Memory Coherence in the Absence of Synchronisation

Question

I've written a basic graph scheduler that synchronises task execution in a wait-free manner. Since the graph topology is immutable, I figured I'll make all atomic operations relaxed. However, as I learned more about the CPU hardware, I started to become concerned about the behaviour of my data structure on platforms with weaker memory models (I've only tested my code on x86). Here's the scenario that bothers me:

Thread 1 (T1) and Thread 2 (T2) concurrently update (non-atomically) memory locations X and Y respectively, and then proceed to execute other unrelated tasks.

Thread 3 (T3) picks up a dependent task after T1 and T2 are done, loads X and Y, and sums them up. There are no acquire/release synchronisations, thread joins, or locks being invoked, and T3's task is guaranteed to be scheduled after T1's and T2's are done.

Assuming that T1, T2, and T3 are scheduled (by the OS) on different CPU cores, my question is: In the absence of any memory fences or lock-like instructions, is T3 guaranteed to see the latest values of X and Y? Another way of asking this is: If you don't insert a fence, how long after a store can you perform a load, or are there no guarantees regarding that?

My concern is that there are no guarantees that the cores which executed T1 and T2 have flushed their store buffers by the time T3's core attempts to load that information. I tend to think of data races as data corruptions that happen due to a load and a store (or store and store) happening at the same time. However, I've come to realise that I'm not quite sure what at the same time really means given the distributed nature of CPUs at the micro scale. According to CppRef:

A program that has two conflicting evaluations has a data race unless:

both evaluations execute on the same thread or in the same signal handler, or

both conflicting evaluations are atomic operations (see std::atomic), or

one of the conflicting evaluations happens-before another (see std::memory_order)

This seems to imply that anyone using my graph scheduler would experience a data race (assuming they don't protect against it themselves) even if I can guarantee that T3 doesn't execute until T1 and T2 are done. I've yet to observe a data race in my tests but I'm not naive enough to think that tests alone suffice to prove this.

I'm not intimately familiar with the C++ memory model, but I imagine that without a fence, T3 could very easily see stale data in its cache. If you're not doing any thread joins or otherwise synchronizing, then T3 being scheduled after T1 and T2 amounts to something of a gentleman's agreement. — Chuck Adams, Feb 22 '20 at 17:46
*how long after a store can you perform a load,* ISO C++ makes zero guarantees about timing. It's almost always a bad idea to rely on timing / distance for correctness when all you need is acquire/release synchronization somewhere *in the scheduler itself*, e.g. T1 and T2 declaring themselves done using a release-store. Otherwise what does it even mean that T3 executes after T1 and T2? — Peter Cordes, Feb 22 '20 at 19:04
(And yes, testing on x86 doesn't prove anything; the hardware memory model is basically acq_rel so compile-time reordering picks some legal order, and then that order runs with acq_rel.) — Peter Cordes, Feb 22 '20 at 19:07
@PeterCordes cheers, Peter! I figured I'd ask before I add any additional synchronisation as I'm still rather new to non-SC atomics and take too much for granted. — Alloth, Feb 22 '20 at 19:40

score 1 · Accepted Answer · answered Feb 22 '20 at 20:02

how long after a store can you perform a load

ISO C++ makes zero guarantees about timing. It's almost always a bad idea to rely on timing / distance for correctness.

In this case all you need is acquire/release synchronization somewhere in the scheduler itself, e.g. T1 and T2 declaring themselves done using a release-store, and the scheduler checking for that with an acquire load.

Otherwise what does it even mean that T3 executes after T1 and T2? If the scheduler could see an "I'm done" store way early, it could start T3 while T1 or T2 is not done all its stores.

If you make sure that everything in T3 happens after T1 and T2 (using acquire loads that "synchronize-with" a release store from each of T1 and T2), you don't even need to use atomics in T1 and T2, just in the scheduler machinery.

Acquire load and release store are relatively cheap compared to seq_cst. On real HW, seq_cst has to flush the store buffer after a store, release doesn't. x86 does acq_rel for free.

(And yes, testing on x86 doesn't prove anything; the hardware memory model is basically acq_rel so compile-time reordering picks some legal order, and then that order runs with acq_rel.)

I'm not sure if starting a new thread guarantees that everything in that thread "happens after" that point in this thread. If so, then this is formally safe.

If not, then in theory IRIW reordering is something to worry about. (All threads using seq_cst loads have to agree about the global order of seq_cst stores, but not with weaker memory orders. In practice PowerPC is the hardware that can do this in real life, AFAIK, and only for short-ish windows. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. Any std::thread constructor would involve a system call and be long enough in practice, as well as involving barriers anyway whether ISO C++ formally guarantees that or not.

If you're not starting a new thread, but instead storing a flag for a worker to see, then acq/rel is again enough; happens-before is transitive so A -> B and B -> C implies A -> C.

Thanks for the added detail, Peter! I'm scheduling my tasks on a thread pool that is completely unaware of the graph structure so the graph itself will have to perform the acquire/release synchronisation. Every task has an atomic "pending" counter that schedules it when it hits 0. It's the perfect place to do a release fetch_sub but I wasn't quite sure where to put the accompanying acquire without having the compiler optimise it away. I suppose I can do an acquire exchange on the thread that picks up the task, right before executing it. I'll need to think about this some more! — Alloth, Feb 22 '20 at 20:24
@AllothTian: oh, I see, so you don't have a "manager" thread per-se. Yeah you probably want atomic RMWs on a counter, rather than some kind of array of booleans. RMWs aren't as cheap, but unless the work you're doing per-thread is way too tiny that little extra overhead is fine. And an array of booleans would still need both T1 and T2 to contend for write access to the cache line, and that's most of the cost. — Peter Cordes, Feb 22 '20 at 20:31
@AllothTian: And yes, to "claim" a thread after checking that it's ready to run, you want some kind of atomic RMW of at least acquire strength, to both make sure only 1 worker claims it, and to sync-with the release stores. — Peter Cordes, Feb 22 '20 at 20:32
Yup, the tasks have a counter that sits in its own cache line to avoid false sharing, and any thread can enqueue other tasks onto the thread pool so there's no centralised manager. Paying for an extra RMW is unfortunate but I don't think I have any other options since I doubt atomic_thread_fences would do anything (and might be too heavy anyway). — Alloth, Feb 22 '20 at 20:46

Relaxed Atomics and Memory Coherence in the Absence of Synchronisation

1 Answers1