ARM weak memory ordering

Question

For a weak ordering memory model such as ARM, how does the following code work?

Note: I'm aware the memory ordering is relevant only to multi-core threads, and that the code below runs only in one core, however still have the below questions.

std::atomic_bool a {false};
std::atomic_bool b {false};

void signal_handler(int) 
{
  if (b.load(std::memory_order_relaxed)) 
  {
    std::atomic_signal_fence(std::memory_order_acquire);
    assert(a.load(std::memory_order_relaxed));
  }
}

int main() 
{
  std::signal(SIGINT, &signal_handler);

  a.store(true, std::memory_order_relaxed);
  std::atomic_signal_fence(std::memory_order_release);
  b.store(true, std::memory_order_relaxed);    
}

The std::atomic_signal_fence only guarantees ordering on the compiler.

Question 1 This is what I understand: The out of order execution can reorder the a.store() and b.store(), that is, it would be possible for the store buffer to contain b == true, and a still not being executed (still a == false). If at this point the signal handler is run, then the assert would fail (reading the value from store buffer, store forwarding). I'm aware how this could not happen in an x86 since stores are not reordered, but not sure how this works on ARM. How does the cpu do to make sure the signal handler sees the code sequentially?

Question 2 In a multi-threaded code environment, if no memory barriers are used, can the store buffers being flushed to cache memory out of order? My assumption is that it could be flushed out of order, otherwise no memory barriers would be needed, however when I read the definition of of Out of Order execution it says "OoOE processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal.". Not sure if the link is only referring to x86 that follows a strong memory model though.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/191875/discussion-on-question-by-advsphere-arm-weak-memory-ordering). — Samuel Liew, Apr 15 '19 at 21:16
@C.M we are on the same page, your comment matches what I'm saying. :). — AdvSphere, Apr 15 '19 at 21:17
@AdvSphere ...and you need compiler-lvl fences only for case of asynchronous signal delivery (i.e. *not* `raise()` case) — C.M., Apr 15 '19 at 21:17
@C.M for signal handler run by same thread, std::atomic_signal_fence, regardless the architecture, ARM, x86. For signal handler in other core, use C++ acquire, release, or sequentially consistent. On the instruction level, for x86, only MFENCE is issued for sequentially consistent. For ARM, memory barriers instructions must be issued for all, acquire, release and sequentially consistent. Obviously that's done by compiler. — AdvSphere, Apr 15 '19 at 21:20
@old_timer, the question is not about C++ standard, but about how the weak memory ordering works in ARM. How to use the standard is clear, this is about implementation level. — AdvSphere, Apr 15 '19 at 21:24
@old_timer when you mention, you don't see the disassembly showing out of order instructions, that's not the relevant part, the relevant part was to how to keep it consistent in a signal handler, since even though the code looks ordered by compiler, it could be executed out of order by "out of order execution unit" in the CPU. Please read above the comments, it seems the signal handler flushes the cpu, where at this point all of the "out of order instructions must be retired", therefore allowing to see "ordering" in a signal handler run in the same core that runs the thread. — AdvSphere, Apr 15 '19 at 21:29
@AdvSphere No probs. I've learned smth too... Just a note -- in multi-threaded case on x86 you don't really need MFENCE, if compiler doesn't reorder access to `a` and `b` (which is why you still need compiler-lvl fence) you'll be alright because of x86 strong consistency (and the fact that you use std::atomic_bool -- this should cause LOCK prefix to be emitted even in case of relaxed access). — C.M., Apr 15 '19 at 21:30
@C.M, yes that's true. When using C++ acquire and release, if you look at the assembly of it, the instructions in x86 do not use any memory barriers. That's because of the TSO. However when a std::memory_order_seq_cst is used, MFENCE is used in x86, since the store buffers must be flushed. Check out this link: https://stackoverflow.com/questions/55231677/x86-mfence-and-c-memory-barrier — AdvSphere, Apr 15 '19 at 21:33
@AdvSphere Huh... You got me confused with `mfence` for a second here, but then I realized that you use GCC which (unlike MSVC) chose `mov+mfence` instead of `xchg` (which has `LOCK` always implied) to implement related aspects of memory model. (see Peter's answer in your link for details). Anyways, [here](https://godbolt.org/z/jdlagG) is how you could write your code in a platform-independent way. I believe it is more readable :) — C.M., Apr 15 '19 at 21:50
@C.M godbolt.org web site is the one I used in the link I provided above (I'm the OP for the link), is an excellent resource! Thank you for sharing your thoughts again! — AdvSphere, Apr 16 '19 at 14:18

ARM weak memory ordering

0 Answers0