Still can't get a true root cause explanation on IRIW scenario

Question

Below is the classical example of why neither func_3 or func_4 may print out (assuming each function runs on its own thread). However, I have yet to find an explanation that satisfies my curiosity for details. Assuming the following event order, If func_1 happens first (x.store), then func_3's x.load will synchronize with thread 1 on x (because they are rel-acq pair). Then it may not print out because y.store hasn't happened by the time thread 3 examine y. Then assuming y.store happen in thread 2 next. It causes a synchronization with thread 4 which get past while condition. The next load of x should synchronize with thread 1 and satisfies the print condition right? The only way thread 4 doesn't print is if compiler can reorder the execution y and x load in thread 4 such that y.load happens after x.load. Is that the reason for the possible outcome of neither print shows up? Hence

std::atomic<bool> x,y;
void func_1() {
  x.store(true, std::memory_order_release);
}
void func_2() {
  y.store(true, std::memory_order_release);
}
void func_3() {
  while(!x.load(std::memory_order_acquire));
  if(y.load(std::memory_order_acquire)) {
    std::cout << "x == true then also y == true \n"; 
  }
}
void func_4() {
  while(!y.load(std::memory_order_acquire));
  if(x.load(std::memory_order_acquire)) {
    std::cout << "y == true then also x == true \n";
  }
}

Never try to reason around the execution order of threads, you'll always be wrong. In this case while it's not technically UB the behavior is effectively non-deterministic but effectively I'd expect both to print in random order. — Mgetz, Aug 31 '21 at 18:29
Possible Duplicate [Acquire/Release versus Sequentially Consistent memory order](https://stackoverflow.com/questions/14861822/acquire-release-versus-sequentially-consistent-memory-order) — apple apple, Aug 31 '21 at 18:32
The release-stores to `x` and `y` are pointless and don't synchronize anything, because there's no previous intra-thread operation before them *to* synchronize. — EOF, Aug 31 '21 at 18:32
This is the IRIW litmus test. Very few real implementations can do such reordering, the notable exception being POWER CPUs that can store-foward between SMT threads before stores become globally visible. See my answer on [Will two atomic writes to different locations in different threads always be seen in the same order by other threads?](https://stackoverflow.com/a/50679223) — Peter Cordes, Aug 31 '21 at 18:45
@PeterCordes I remember reading about [that fun on the XBox360](https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-design-bug-in-the-xbox-360/)... that was insane, and didn't always work right IIRC. — Mgetz, Aug 31 '21 at 18:50
I looked at related threads actually before I posted this question because none official answer gets to the bottom of why. Most of them just that you can't assume thread 3 and 4 can see the same change order of x and y (done by 1 and 2). I understand that is the behavior but not why. I have seen one or two comments that says it is because y.load and x.load in thread 4 may be reordered (which I pointed out in my hypothesis). — Kenneth, Aug 31 '21 at 19:29
Did you read my answer, linked in my previous comment? It explains a specific hardware / CPU-architecture mechanism that produces this reordering on POWER CPUs, while still maintaining other correctness properties. That's why I added it as a duplicate of this question. No, it's not because of local load-reordering in the readers, the use of acquire loads does successfully prevent that. — Peter Cordes, Aug 31 '21 at 22:00
If you mean in C++ language-lawyer terms, it's because ISO C++ is intentionally that weak for non-seq-cst operations so PowerPC doesn't need so many expensive barriers. The way that's done in the standard's formalism is that the existence of a total order of *all* atomic ops across different objects is only required to exist for seq_cst operations. (And IIRC, only if you don't also use weaker operations on the same objects or something like that, although on most real implementations it's not that fragile.) — Peter Cordes, Aug 31 '21 at 22:01
@PeterCordes by my interpreting the C++ standard language on acquire/release, I couldn't use it and arrive at the conclusion that is possible to observe no print outs. As you have stated, one should be able to reason and reach the same conclusion without the need to understand POWER or ARM architecture. I haven't seen anyone that list all the permutations of instruction interleaving + standard that produce the result. — Kenneth, Sep 01 '21 at 00:05

Still can't get a true root cause explanation on IRIW scenario

0 Answers0

Linked