I am currently trying to understand the following code example from the book "C++ Concurrency in Action":
#include <stdio.h>
#include <atomic>
#include <thread>
#undef NDEBUG // for release builds
#include <assert.h>
std::atomic<bool> x, y;
std::atomic<int> z;
void write_x()
{
x.store(true, std::memory_order_release);
}
void write_y()
{
y.store(true, std::memory_order_release);
}
void read_x_then_y()
{
while (!x.load(std::memory_order_acquire))
;
if (y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while (!y.load(std::memory_order_acquire))
;
if (x.load(std::memory_order_acquire))
++z;
}
int main(int argc, char *argv[])
{
for (;;)
{
x = false;
y = false;
z = 0;
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join();
b.join();
c.join();
d.join();
assert(z.load() != 0);
}
return 0;
}
I don't quite understand how the assertion can fire, as is claimed in the book. How do the lines of code of the four threads have to be arranged side by side so that z is equal zero. In principle, this can only happen if the lines can be rearranged after the spin-lock, right?. But as I understood it, an Acquire guarantees that Loads and Stores after the Acquire cannot be reordered before the Acquire. And a Load before an Acquire cannot be reordered after an Acquire.
Thanks in advance for clarification.
EDIT
I think the whole thing would be more understandable if someone explained it from a technical perspective (cache, fences, reordering, flush, invalidate, etc.) and using a case study with z=0.
Example to explain it, but I need it for z=0 (technically):
Thread 1 Thread 2 Thread 3 Thread 4
while (!x); while (!y);
x=1 while (!x); while (!y);
if (y) // y=0 while (!y);
y=1 while (!y);
while (!y);
if (x) // x=1
z++
Result: z=1
What I also do not quite understand is the following: Why does the release constraint have to be specified for the stores at all? Why is relaxed not enough here? In principle, there are no further operations (loads, stores) specified above the stores that have to be flushed or whose reordering has to be prevented via a fence. Isn't write_x
equivalent to this?
// Store / loads which may not cross fence boundary.
// No store / loads here, so why needs to be a fence here?
std::atomic_thread_fence(std::memory_order_release);
x.store(true, std::memory_order_relaxed);
I have run the above example on my x86-64 for a long time (endless for-loop) and have not encountered an assert yet. Is this due to the strong memory model of the Intel?
EDIT 2
I revisited the topic after a long time and found the following stackoverflow thread: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? . I think I found the explanation on a technical level through this. Namely, this result can come about as follows. Precondition: Weakly-ordered CPU with SMT (Hyperthreading) functionality. There it seems that one logical processor can pull data from the shared store buffer of another logical processor running on the same core (= Store Forwarding). Schematically, it would be like this:
+--------------------------------------------------------+
| Core 0 |
+--------------------------------------------------------+
| Logical Core #0 |
+--------------------------------------------------------+
| x.store(true, std::memory_order_release); | <- Place x into StoreBuffer first before committing to L1D
+--------------------------------------------------------+
| Logical Core #1 |
+--------------------------------------------------------+
| while (!x.load(std::memory_order_acquire)) | <- Read new x from StoreBuffer which is not visible for other Cores yet (x = true)
| ; |
| if (y.load(std::memory_order_acquire)) | <- new y not visible yet, still in StoreBuffer of other Core (y = false)
| ++z; |
| |
+--------------------------------------------------------+
Same behavior for other Core:
+--------------------------------------------------------+
| Core 1 |
+--------------------------------------------------------+
| Logical Core #0 |
+--------------------------------------------------------+
| y.store(true, std::memory_order_release); | <- Place y into StoreBuffer first before committing to L1D
+--------------------------------------------------------+
| Logical Core #1 |
+--------------------------------------------------------+
| while (!y.load(std::memory_order_acquire)) | <- Read new y from StoreBuffer which is not visible for other Cores yet (y = true)
| ; |
| if (x.load(std::memory_order_acquire)) | <- new x not visible yet, still in StoreBuffer of other Core (x = false)
| ++z; |
| |
+--------------------------------------------------------+
Is my assumption correct and can anyone confirm this? Thank you in advance.