Trying to understand code example with atomics from book "C++ Concurrency in Action" (page 133)

Question

I am currently trying to understand the following code example from the book "C++ Concurrency in Action":

#include <stdio.h>

#include <atomic>
#include <thread>

#undef NDEBUG // for release builds
#include <assert.h>

std::atomic<bool> x, y;
std::atomic<int> z;

void write_x()
{
    x.store(true, std::memory_order_release);
}

void write_y()
{
    y.store(true, std::memory_order_release);
}

void read_x_then_y()
{
    while (!x.load(std::memory_order_acquire))
        ;
    if (y.load(std::memory_order_acquire))
        ++z;
}

void read_y_then_x()
{
    while (!y.load(std::memory_order_acquire))
        ;
    if (x.load(std::memory_order_acquire))
        ++z;
}

int main(int argc, char *argv[])
{
    for (;;)
    {
        x = false;
        y = false;
        z = 0;

        std::thread a(write_x);
        std::thread b(write_y);
        std::thread c(read_x_then_y);
        std::thread d(read_y_then_x);

        a.join();
        b.join();
        c.join();
        d.join();

        assert(z.load() != 0);
    }

    return 0;
}

I don't quite understand how the assertion can fire, as is claimed in the book. How do the lines of code of the four threads have to be arranged side by side so that z is equal zero. In principle, this can only happen if the lines can be rearranged after the spin-lock, right?. But as I understood it, an Acquire guarantees that Loads and Stores after the Acquire cannot be reordered before the Acquire. And a Load before an Acquire cannot be reordered after an Acquire.

Thanks in advance for clarification.

EDIT

I think the whole thing would be more understandable if someone explained it from a technical perspective (cache, fences, reordering, flush, invalidate, etc.) and using a case study with z=0.

Example to explain it, but I need it for z=0 (technically):

Thread 1  Thread 2    Thread 3        Thread 4

                      while (!x);     while (!y);
x=1                   while (!x);     while (!y);
                      if (y) // y=0   while (!y);
          y=1                         while (!y);
                                      while (!y);
                                      if (x) // x=1
                                      z++

Result: z=1

What I also do not quite understand is the following: Why does the release constraint have to be specified for the stores at all? Why is relaxed not enough here? In principle, there are no further operations (loads, stores) specified above the stores that have to be flushed or whose reordering has to be prevented via a fence. Isn't write_x equivalent to this?

// Store / loads which may not cross fence boundary.
// No store / loads here, so why needs to be a fence here?
std::atomic_thread_fence(std::memory_order_release);
x.store(true, std::memory_order_relaxed);

I have run the above example on my x86-64 for a long time (endless for-loop) and have not encountered an assert yet. Is this due to the strong memory model of the Intel?

EDIT 2

I revisited the topic after a long time and found the following stackoverflow thread: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? . I think I found the explanation on a technical level through this. Namely, this result can come about as follows. Precondition: Weakly-ordered CPU with SMT (Hyperthreading) functionality. There it seems that one logical processor can pull data from the shared store buffer of another logical processor running on the same core (= Store Forwarding). Schematically, it would be like this:


+--------------------------------------------------------+
| Core 0                                                 |
+--------------------------------------------------------+
| Logical Core #0                                        |
+--------------------------------------------------------+
| x.store(true, std::memory_order_release);              | <- Place x into StoreBuffer first before committing to L1D
+--------------------------------------------------------+
| Logical Core #1                                        |
+--------------------------------------------------------+
| while (!x.load(std::memory_order_acquire))             | <- Read new x from StoreBuffer which is not visible for other Cores yet (x = true)
|   ;                                                    |
| if (y.load(std::memory_order_acquire))                 | <- new y not visible yet, still in StoreBuffer of other Core (y = false)
|   ++z;                                                 |
|                                                        |
+--------------------------------------------------------+

Same behavior for other Core:

+--------------------------------------------------------+
| Core 1                                                 |
+--------------------------------------------------------+
| Logical Core #0                                        |
+--------------------------------------------------------+
| y.store(true, std::memory_order_release);              | <- Place y into StoreBuffer first before committing to L1D
+--------------------------------------------------------+
| Logical Core #1                                        |
+--------------------------------------------------------+
| while (!y.load(std::memory_order_acquire))             | <- Read new y from StoreBuffer which is not visible for other Cores yet (y = true)
|   ;                                                    |
| if (x.load(std::memory_order_acquire))                 | <- new x not visible yet, still in StoreBuffer of other Core (x = false)
|   ++z;                                                 |
|                                                        |
+--------------------------------------------------------+

Is my assumption correct and can anyone confirm this? Thank you in advance.

`atomic x,y,z;` makes *no* sense. [std::atomic](https://en.cppreference.com/w/cpp/atomic/atomic) needs a type template parameter. In any case, please always post a [mcve]. Otherwise noone else can test your code. — Jesper Juhl, Feb 01 '20 at 14:21
"// Simplified example in pseudo-C++" - No. Don't do that. Post a [mcve] instead. — Jesper Juhl, Feb 01 '20 at 14:24
@JesperJuhl OP does mention that it is simplified pseudo-c++. I think it is clear what they mean. — walnut, Feb 01 '20 at 14:24
`void main() {` - That has never been valid C++. `main` returns `int` *always*. — Jesper Juhl, Feb 01 '20 at 14:25
@OP: The next page of the book explains what happens in a nice diagram and with a description. What exactly is not clear to you? — walnut, Feb 01 '20 at 14:25
And please don't post questions refering to books that everyone may not have. *All* relevant information needs to be *in* the question! — Jesper Juhl, Feb 01 '20 at 14:27
I have adapted the code example so that it is comprehensible and executable without having to have the book. I also added a few more questions. I would like to have this technically explained for the case z=0. The answers below did not help much. I also read the blog entry of Preshing again to understand this better (Link: https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/). Atomics are not easy. Many thanks in advance. — mnemonix, Feb 02 '20 at 15:15
@mnemonix This kind of diagram is not useful here, because there is no total sequential order in which stores happen. Different threads can see stores happen in different order. You are correct that release/acquire here is pointless and that it is basically just identical to relaxed operations. The store on `x` could propagate to the `read_x_then_y` thread before the store on `y` does, but at the same time the store on `y` could propagate to the `read_y_then_x` thread before the `x` store does. — walnut, Feb 02 '20 at 16:02
@walnut Thanks, walnut, for your answer. Hm, but it seems to me there is something missing for me to fully understand it. I don't want to understand it on a formal level, but rather on a technical level (case study z=0 using any CPU). We basically only have load/stores via atomics here. Aren't atomics normally written and read directly into/from main memory? Or why not all threads see the same value of x and y at the same time? Because then z can never become 0, right? Maybe I am too stupid to understand it. In my practical example (endless loop) no assert was fired yet... — mnemonix, Feb 02 '20 at 17:36
Maybe the probability is also very small that this happens and you have to run the program for a week. Any reference, article, blog post about the issue to better understand this example? — mnemonix, Feb 02 '20 at 17:37

score 1 · Answer 1 · answered Feb 01 '20 at 14:37

The problem is that the two stores of x and y are not ordered with each other, therefore they cannot be used to synchronize visibility:

atomic x,y,z;
void write_x() {
  x.store(true, release);
}
void write_y() {
  y.store(true, release);
}

void read_x_then_y() {
  while (!x.load(acquire))
    ;
  // has seen x==true, but there is no guarantee that y.load() will return true
  if (y.load(acquire))
    ++z;
}

void read_y_then_x() {
  while (!y.load(acquire))
    ;
  // has seen y==true, but there is no guarantee that x.load() will return true
  if (x.load(acquire))
    ++z;
}

void main() {
  x=false;
  y=false;
  z=0;
  thread a(write_x);
  thread b(write_y);
  thread c(read_x_then_y);
  thread d(read_y_then_x);
  join(a,b,c,d);
  assert(z.load()!=0);
}

So it can happen that both threads read false after the while loop, and therefore z is never incremented.

The point is that the store operations use memory_order_release, so there is no single total order, and that is also why arranging the lines side-by-side does not help to explain this behavior.

score 0 · Answer 2 · answered Feb 01 '20 at 14:38

Acquire/release only enforces an ordering with stores before the acquire operation in the thread performing the acquire operation.

There is no store before the acquire operations and therefore effectively no additional ordering is imposed relative to the relaxed case.

The only ordering imposed also in the relaxed case is that the atomic variables individually have a total modification order that all threads agree on. But two threads can disagree on which of the two stores on different atomics in your example happened first.

Trying to understand code example with atomics from book "C++ Concurrency in Action" (page 133)

2 Answers2