8

In the following code

int a = A.load(std::memory_order_acquire);

T b = load_non_atomic(data);

// ---- barrier ----

int c = A.load(std::memory_order_acquire);

What kind of barrier should I use to avoid reordering of load_non_atomic() after c even on weak memory model architectures (e.g ARM)?

Intuitively I need a std::atomic_thread_fence(std::memory_order_release) to disallow r/w operations to be reordered after it, but does it allowed to use release for loads?

Brian Bi
  • 111,498
  • 10
  • 176
  • 312
  • You can always use `std::memory_order_seq_cst` which should force a full barrier in both directions but is rather heavy. – Mgetz Oct 14 '22 at 13:11
  • @Mgetz yeah, but I'm looking for the most efficient, but also correct way to do that. – Alekseev Artem Oct 14 '22 at 13:14
  • `std::memory_order_seq_cst` is definitely _correct_, so hence the recommendation. Because `std::atomic.load` can't use `std::memory_order_release` that's UB. Looking at cppref `std::atomic_thread_fence(std::memory_order_release);` it looks like it won't do it, it only prevents moving past _stores_ not reads. – Mgetz Oct 14 '22 at 13:23
  • @Mgetz yeah, but on ARM both `std::memory_order_seq_cst` and `std::memory_order_release` generates `DMB ISH` instruction, which is full memory barrier. `std::memory_order_aquire` generates `DMB ISHLD` barrier, whish looks like what I need (but I'm not sure yet). But from C++ side, I'm not sure, how acquire barrier could do the job here. – Alekseev Artem Oct 14 '22 at 13:32
  • Well remember just because it generates the same instruction doesn't mean its the same from the standard's perspective. Using the wrong one will cause breaks later when the underlying machine changes or compilers change what they emit for that. Unless you _know_ there is a major perf issue here it's better to be conservative. – Mgetz Oct 14 '22 at 13:36
  • Yeah, thanks for your answer @Mgetz, but I'm still curious, if we could solve that problem without seq_cst, as we really don't want a full memory barrier to appear. – Alekseev Artem Oct 14 '22 at 13:52
  • @Mgetz: Where do you want to put `seq_cst`? Putting it on the second `A.load` doesn't suffice. A `seq_cst` load is only acquire for ordering with respect to non-atomic operations. The extra sequential consistency guarantees only apply to ordering with respect to other atomic `seq_cst` operations. – Nate Eldredge Oct 14 '22 at 13:57
  • What you really want is for the non-atomic `b` load to be atomic and acquire. In principle you can get the same ordering effect by putting an acquire fence after your non-atomic load. – Nate Eldredge Oct 14 '22 at 13:59
  • @NateEldredge they'd have to use `atomic_thread_fence` I just don't see any other options than what you're saying to do. – Mgetz Oct 14 '22 at 14:01
  • But the bigger question is, what is on the other side of this code? If some other thread intends to write to your non-atomic `data`, you have to establish synchronization or else you have a data race and UB. And I don't see how keeping it before another load can achieve that. In some sense you want the other thread to wait until the `c` load is complete before writing to `data`, but that other thread cannot directly detect whether a load has been done. [...] – Nate Eldredge Oct 14 '22 at 14:02
  • The only way for this thread to share that information is with a store, and so normally you would get the required ordering by making *that* store be release. In other words, there is a good reason why this functionality doesn't really exist, because it's unclear how you could safely make use of it. If you think you really do have a use case, then would you like to explain what the complete code looks like, and what it is supposed to accomplish? – Nate Eldredge Oct 14 '22 at 14:02
  • @NateEldredge, basically, `a` and `c` here are markers, that no write operation happened between that loads. Next in code I would compare `a == c` to see if they changed during that operation (that mean that store operation appeared in between them). – Alekseev Artem Oct 14 '22 at 14:09
  • @NateEldredge, yeah, giving that non_atomic_load semantic of atomic_load with acquire is what I need, but I'm a bit confused of the following scenario. ```c++ // was there --- T b = load_non_atomic(data); std::atomic_thread_fence(std::memory_order_acquire) int c = A.load(std::memory_order_aquire); // --- moved here T b = load_non_atomic(data); ``` – Alekseev Artem Oct 14 '22 at 14:13
  • Does that mean there's a symmetric store/release associated with something that affects `load_non_atomic`? – Useless Oct 14 '22 at 14:14
  • Well, I couldn't format the code in comments :( I hope, you could understand what I mean. – Alekseev Artem Oct 14 '22 at 14:15
  • @Useless writer thread first increment the `A` atomic with release and then modify `data` variable. – Alekseev Artem Oct 14 '22 at 14:16
  • Whell, the overall idea is pretty similar to how seqlock works, you could find an example here https://github.com/rigtorp/Seqlock Here author also mention such problem in ReadMe, and he propose to write inline asm for ARM, which I'm not a big fan about, so I really hope to find a standard way to do this. – Alekseev Artem Oct 14 '22 at 14:21
  • But see, this is just what I mean. If another thread can store to the non-atomic `data` "in between" loads `a` and `c`, you have a data race, and the entire behavior of your program has become undefined. It does you no good to test `a == c` afterwards to see if it happened. The damage is done and the [nasal demons](http://catb.org/jargon/html/N/nasal-demons.html) are already in flight. In particular, the consequences of this race are *not* limited to just getting an erroneous value in `data`; they can be arbitrarily worse than that. – Nate Eldredge Oct 14 '22 at 15:39
  • I'm afraid the author of the SeqLock example you linked is quite misguided about the C++ memory model, or at the very least, is relying on implementation details of some specific compilers. (Even on x86, where the hardware memory model would be okay with this, a compiler can still give you the wrong code, because the memory model it presents to you need not match the hardware.) If you're going to create a SeqLock in conforming C++, then the `data` has to be an atomic object as well. And at that point, you have the option of imposing acquire ordering on the load of `data`. – Nate Eldredge Oct 14 '22 at 15:44
  • Yeah, @NateEldredge, thanks, I see what you mean. Seqlock is a kind of algorithm with intentional datarace. And yeah, C++ memory model doesn't need to be well defined for that kind of algorightms, as it is UB from it's point of view. I also found that question to be pretty interesting https://stackoverflow.com/questions/56419723/which-of-these-implementations-of-seqlock-are-correct – Alekseev Artem Oct 14 '22 at 18:13
  • There is no barrier that would prevent that reordering, because compilers are always allowed to reorder your code when it wouldn't change the observable behaviour of the code. Whether `data` is loaded before or after `A` makes no difference as to the observable behaviour of the code. You can force the relative ordering of the loads to be considered observable behaviour by making both `data` and `A` volatile, but not by inserting a barrier. If you think I'm wrong, please explain why. – Brian Bi Oct 14 '22 at 21:56
  • @BrianBi: The relative order of two *atomic* (but not volatile) loads is also observable without UB, as in the standard LoadLoad litmus test. And appropriate barriers can force them to remain in a particular order. – Nate Eldredge Oct 15 '22 at 14:15
  • @NateEldredge Sure, if the first load has acquire semantics, then it can provide guarantees regarding the result of the second load. The compiler can't reorder the loads since that would fail to provide the same guarantee. I see no similar logic that is applicable to the OP's code. – Brian Bi Oct 15 '22 at 16:06

0 Answers0