std::memory_order_acquire fence necessary on x86?

Question

Given x86 has a strong memory model, is std::memory_order_acquire fence (not operation) necessary?

For example, if I have this code:

uint32_t read_shm(const uint64_t offset) {
   // m_head_memory_location is char* pointing to the beginning of a mmap-ed named shared memory segment
   // a different process on different core will write to it.
   return *(uint32_t*)(m_head_memory_location + offset);
}
....
int main() {
     uint32_t val = 0;
     while (0 != (val = shm.read(some location)));
     .... // use val
}

Do I really need std::atomic_thread_fence(std::memory_order_acquire) before the return statement?

I feel it is not necessary because the goal of the code above is trying to read the first 4 bytes from m_head_memory_location + offset, and so any memory operations after fences being reordered doesn't affect the outcome.

Or there is some side effect making the acquire fence necessary?

Is there any case that an acquire fence (not operation) is necessary on x86?

Any input is welcome.

Remember, you are programming against the abstract machine, not the actual hardware. If you need a fence on the abstract machine, you should have on in your code. The compiler can then decide that it doesn't actually need to do it since it know which platform it is compiling the code for. — NathanOliver, Dec 18 '19 at 15:19
@NathanOliver-ReinstateMonica I agree. Just trying to understand the effect/use case of `std::memory_order_acquire` on x86 in particular and in my particular example above. Thanks for your reminder tho :) — HCSF, Dec 18 '19 at 15:31
Can you rewrite your code so it's clear that `read` isn't `::read`? — curiousguy, Dec 20 '19 at 04:18
Do you use C++11 as a shorthand for a specific, old C++ version, or for any C++ std not older than C++11? — curiousguy, Jan 16 '20 at 23:10

Peter Cordes · Answer 1 · 2019-12-18T16:27:36.673

6

return *(uint32_t*)(m_head_memory_location + offset);

You cast to non-atomic non-volatile uint32_t* and dereference!!!

The compiler is allowed to assume that this uint32_t object isn't written by anything else (i.e. assume no data-race UB), so it can and will hoist the load out of the loop, effectively transforming it into something like if((val=load) == 0) infinite_loop();.

A GCC memory barrier will force a reload, but this is an implementation detail for std::atomic_thread_fence(std::memory_order_acquire). For x86, that barrier only needs to block compile-time reordering, so a typical implementation for GCC might be asm("" ::: "memory").

It's not the acquire ordering that's doing anything, it's the memory clobber that stops GCC from assuming another read will read the same thing. That's not something ISO C++ std::atomic_thread_fence(std::memory_order_acquire) implies for non-atomic variables. (And it's always implied for atomic and volatile). So like I said, this would work in GCC but only as an implementation detail.

It's also strict-aliasing UB if this memory is ever accessed with other types than this an char*, or if the underlying memory was declared as a char[] array. If you got a char* from mmap or something then you're fine.

It's also possible misalignment UB unless offset is known to be a multiple of 4. (Although unless GCC chooses to auto-vectorize, this won't bite you in practice on x86.)

You can solve these two for GNU C with typedef uint32_t unaligned_u32 __attribute((may_alias, aligned(1))); but you still need volatile or atomic<T> for reading in a loop to work.

In general

Use std::atomic_thread_fence(std::memory_order_acquire); as required by the C++ memory model; that's what governs reordering at compile time.

When compiling for x86, it won't turn into any asm instructions; in asm it's a no-op. But if you don't tell the compiler it can't reorder something, your code might break depending on compiler optimization level.

You might get lucky and have the compiler do a non-atomic load after an atomic mo_relaxed load, or it might do the non-atomic load earlier if you don't tell it not to.

edited Dec 18 '19 at 16:27

answered Dec 18 '19 at 15:59

Peter Cordes

328,167
45
605
847

Thanks for the detailed answer. So you are saying `std::atomic_thread_fence(std::memory_order_acquire)` doesn't force a reload? Regarding using `volatile` keyword for shared memory, Ian is [against it](https://gcc.gnu.org/ml/gcc-help/2012-03/msg00242.html). And as the original poster pointed out that using volatile caused [other issues](https://gcc.gnu.org/ml/gcc-help/2012-03/msg00239.html). Shouldn't `asm("" ::: "memory")` w/o `volatile` be enough as it forces a reload? – HCSF Dec 18 '19 at 16:45
The main issue is that C++ standard doesn't really mention how C++ memory model works with shared memory. Hence, the solution for shared memory is likely relying on some implementation specific stuffs... – HCSF Dec 18 '19 at 16:47
1

@HCSF: Extended `asm` with not outputs is implicitly `volatile`. The C++ standard says that lock-free atomics *should* work on shared memory. Casting to `atomic*` would work because it's always-lock-free on x86. Make sure it's aligned. (The standard says that they should be address-free so they work even if two processes map the same memory to different virtual addresses. This is trivially true for normal CPUs where hardware handles coherence based on physical address.) Of course you shouldn't use `volatile`; my answer even says so. But it would work in practice. – Peter Cordes Dec 18 '19 at 16:52
I am sure you are right. Do you have section # in the standard I can read more about "C++ standard says that lock-free atomics should work on shared memory" and "Casting to atomic* would work"? As I just searched for "shared memory" in n4713 draft and I only found 2 occurrences 6.8.2.1/22 and 6.8.2.1/23. – HCSF Dec 18 '19 at 17:09
@HCSF: [Are lock-free atomics address-free in practice?](//stackoverflow.com/q/51463312) quotes the relevant part of the standard. It doesn't use the phrase "shared memory". Also [Is C++11 atomic usable with mmap?](//stackoverflow.com/q/18321244) Oh lol, you quoted the relevant part yourself in [Two Different Processes With 2 std::atomic Variables on Same Address?](//stackoverflow.com/q/51229208) – Peter Cordes Dec 18 '19 at 17:17
Lol. Yes I did long time ago. Reading my post again refreshed memory that it seems I [can't just cast](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4013.html). Thought? – HCSF Dec 18 '19 at 17:31
@HCSF: that's about doing some atomic operations on data that's *also* accessed as a non-atomic object by the same program. You can just cast if you *always* access it as an `atomic`. – Peter Cordes Dec 18 '19 at 17:35
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204516/discussion-between-hcsf-and-peter-cordes). – HCSF Dec 19 '19 at 02:47

std::memory_order_acquire fence necessary on x86?

1 Answers1

In general