I've written a basic graph scheduler that synchronises task execution in a wait-free manner. Since the graph topology is immutable, I figured I'll make all atomic operations relaxed. However, as I learned more about the CPU hardware, I started to become concerned about the behaviour of my data structure on platforms with weaker memory models (I've only tested my code on x86). Here's the scenario that bothers me:
Thread 1 (T1) and Thread 2 (T2) concurrently update (non-atomically) memory locations X and Y respectively, and then proceed to execute other unrelated tasks.
Thread 3 (T3) picks up a dependent task after T1 and T2 are done, loads X and Y, and sums them up. There are no acquire/release synchronisations, thread joins, or locks being invoked, and T3's task is guaranteed to be scheduled after T1's and T2's are done.
Assuming that T1, T2, and T3 are scheduled (by the OS) on different CPU cores, my question is: In the absence of any memory fences or lock-like instructions, is T3 guaranteed to see the latest values of X and Y? Another way of asking this is: If you don't insert a fence, how long after a store can you perform a load, or are there no guarantees regarding that?
My concern is that there are no guarantees that the cores which executed T1 and T2 have flushed their store buffers by the time T3's core attempts to load that information. I tend to think of data races as data corruptions that happen due to a load and a store (or store and store) happening at the same time. However, I've come to realise that I'm not quite sure what at the same time really means given the distributed nature of CPUs at the micro scale. According to CppRef:
A program that has two conflicting evaluations has a data race unless:
- both evaluations execute on the same thread or in the same signal handler, or
- both conflicting evaluations are atomic operations (see std::atomic), or
- one of the conflicting evaluations happens-before another (see std::memory_order)
This seems to imply that anyone using my graph scheduler would experience a data race (assuming they don't protect against it themselves) even if I can guarantee that T3 doesn't execute until T1 and T2 are done. I've yet to observe a data race in my tests but I'm not naive enough to think that tests alone suffice to prove this.