Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?

Question

C++11 specifies six memory orderings:

typedef enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
} memory_order;

https://en.cppreference.com/w/cpp/atomic/memory_order

where the default is seq_cst.

Performance gains can be found by relaxing the memory ordering of operations. However, this depends on what protections the architecture provides. For example, Intel x86 is a strong memory model and guarantees that various loads/store combinations will not be re-ordered.

As such relaxed, acquire and release seem to be the only orderings required when seeking additional performance on x86.

Is this correct? If not, is there ever a need to use consume, acq_rel and seq_cst on x86?

Not 100% sure, but I think if you use a weaker-than-necessary memory ordering, the compiler will happily screw you over even if the CPU wouldn't. — Joseph Sible-Reinstate Monica, May 10 '20 at 22:50
@JosephSible-ReinstateMonica might be worth putting that as an answer, just so it's more obvious than a comment. — user997112, May 12 '20 at 15:38
@user997112: my answer does mention memory ordering at compile time as the reason why relaxed is different from release when compiling for x86. — Peter Cordes, May 12 '20 at 15:40

Peter Cordes · Accepted Answer · 2021-09-29T21:23:37.767

If you care about portable performance, you should ideally write your C++ source with the minimum necessary ordering for each operation. The only thing that really costs "extra" on x86 is mo_seq_cst for a pure store, so make a point of avoiding that even for x86.

(relaxed ops can also allow more compile-time optimization of the surrounding non-atomic operations, e.g. CSE and dead store elimination, because relaxed ops avoid a compiler barrier. If you don't need any order wrt. surrounding code, tell the compiler that fact so it can optimize.)

Keep in mind that you can't fully test weaker orders if you only have x86 hardware, especially atomic RMWs with only acquire or release, so in practice it's safer to leave your RMWs as seq_cst if you're doing anything that's already complicated and hard to reason about correctness.

x86 asm naturally has acquire loads, release stores, and seq_cst RMW operations. Compile-time reordering is possible with weaker orders in the source, but after the compiler makes its choices, those are "nailed down" into x86 asm. (And stronger store orders require an mfence after mov, or using xchg. seq_cst loads don't actually have any extra cost, but it's more accurate to describe them as acquire because earlier stores can reorder past them, and all being acquire means they can't reorder with each other.)

There are very few use-cases where seq_cst is required (draining the store buffer before later loads can happen). Almost always a weaker order like acquire or release would also be safe.

There are artificial cases like https://preshing.com/20120515/memory-reordering-caught-in-the-act/, but even implementing locking generally only requires acquire and release ordering. (Of course taking a lock does require an atomic RMW, so on x86 that might as well be seq_cst.) One practical use-case I came up with was to have multiple threads set bits in an array. Avoid atomic RMWs and detect when one thread stepped on another by re-checking values that were recently stored. You have to wait until your stores are globally visible before you can safely reload them to check.

As such relaxed, acquire and release seem to be the only orderings required on x86.

From one POV, in C++ source you don't require any ordering weaker than seq_cst (except for performance); that's why it's the default for all std::atomic functions. Remember you're writing C++, not x86 asm.

Or if you mean to describe the full range of what x86 asm can do, then it's acq for loads, rel for pure stores, and seq_cst for atomic RMWs. (The lock prefix is a full barrier; fetch_add(1, relaxed) compiles to the same asm as seq_cst). x86 asm can't do a relaxed load or store¹.

The only benefit to using relaxed in C++ (when compiling for x86) is to allow more optimization of surrounding non-atomic operations by reordering at compile time, e.g. to allow optimizations like store coalescing and dead-store elimination. Always remember that you're not writing x86 asm; the C++ memory model applies for compile-time ordering / optimization decisions.

acq_rel and seq_cst are nearly identical for atomic RMW operations in ISO C++, I think no difference when compiling for ISAs like x86 and ARMv8 that are multi-copy-atomic. (No IRIW reordering like e.g. POWER can do by store-forwarding between SMT threads before a store commits to L1d). How do memory_order_seq_cst and memory_order_acq_rel differ?

For barriers, atomic_thread_fence(mo_acq_rel) compiles to zero instructions on x86, while fence(seq_cst) compiles to mfence or a faster equivalent (e.g. a dummy locked instruction on some stack memory). When is a memory_order_seq_cst fence useful?

You could say acq_rel and consume are truly useless if you're only compiling for x86. consume was intended to expose the dependency ordering that most weakly-ordered ISAs do (notably not DEC Alpha). But unfortunately it was designed in a way that compilers couldn't implement safely so they currently just give up and promote it to acquire, which costs a barrier on some weakly-ordered ISAs. But on x86, acquire is "free" so it's fine.

If you actually do need efficient consume, e.g. for RCU, your only real option is to use relaxed and don't give the compiler enough information to optimize away the data dependency from the asm it makes. C++11: the difference between memory_order_relaxed and memory_order_consume.

Footnote 1: I'm not counting movnt as a relaxed atomic store because the usual C++ -> asm mapping for release operations uses just a mov store, not sfence, and thus would not order an NT store. i.e. std::atomic leaves it up to you to use _mm_sfence() if you'd been messing around with _mm_stream_ps() stores.

PS: this entire answer is assuming normal WB (write-back) cacheable memory regions. If you just use C++ normally under a mainstream OS, all your memory allocations will be WB, not weakly-ordered WC or strongly-ordered uncacheable UC or anything else. In fact even if you wanted a WC mapping of a page, most OSes don't have an API for that. And std::atomic release stores would be broken on WC memory, weakly-ordered like NT stores.

Picking a nit: `seq_cst` is not "*the default for everything*", it's only the default for operations on *atomic* objects. For everything else, the default is "good luck", except where something *synchronizes* with something else, to give a (brief) moment of clarity. — Chris Hall, May 11 '20 at 18:13
@ChrisHall: fair point, edited to avoid that possible misinterpretation. What I meant is that it's the default for everything that takes a `std::memory_order` parameter and has a default. — Peter Cordes, May 11 '20 at 18:24
@PeterCordes I assume RMW includes compare_exchange_weak, so never use relax/acquire/release with this on x86, just leave it as the default seq_cst? — user997112, May 12 '20 at 15:45
Yes, the question is meant from the perspective of performance. — user997112, May 12 '20 at 15:55
@user997112: You wanted to know which order you should actually use in your own C++ source? I wish you'd said so in the first place, that's way easier to answer. Added a section at the top about that. Anyway yes, `lock cmpxchg` is an atomic RMW and unconditionally dirties the cache line. There's usually no reason to actually use `seq_cst` RMWs though; it doesn't make your code faster on x86, just *not slower*. — Peter Cordes, May 12 '20 at 16:02
@PeterCordes Apologies, writing it at the time I obviously knew what I meant, but perhaps didn't communicate it well. I was asking if there's ever a need on x86 to use the other 3 memory orderings (when wanting to gain performance). — user997112, May 12 '20 at 23:38
@user997112: you can edit your question to clarify. I thought you were just asking if there was a meaningful difference between them for any operations. Obviously (to me) `seq_cst` isn't going to make anything faster, and can do things that acq and rel can't do, so I assumed you must be wondering something else. — Peter Cordes, May 12 '20 at 23:44
@PeterCordes done. Is that clearer? So i'm interested in increasing performance. I only code for x86, therefore do I only need to understand seq_cst, relaxed, release and acquire? — user997112, May 14 '20 at 02:08
@user997112: Yeah, good edit. And yes, AFAIK `acq_rel` will compile the same as `seq_cst` on x86. I think the main difference (on a weakly-ordered ISA like POWER) is that `acq_rel` doesn't block IRIW reordering, but on x86 all cores always agree on a total order for all stores. And `consume` is also only relevant to weakly-ordered ISAs (dependency-ordering without full acquire), but compilers promote it to `acquire` anyway because the ISO C++ definition of it is too hard to implement safely. x86 will never benefit from `consume` even once compilers learn to do it on ARM / POWER / etc. — Peter Cordes, May 14 '20 at 02:13
For rare cases; you may be working with "memory" with very different properties (e.g. for low level graphics/video driver work the entire frame buffer may be write-combining and have a much weaker memory ordering). For 80x86 it's also possible (using page table flags/PAT) to modify the rules for any normal RAM; which can (in theory, given that most operating systems don't support it) be beneficial in some specialized cases (e.g. to avoid cache pollution). — Brendan, May 14 '20 at 04:09
@Brendan: I decided not to mention that because if you do that, you can't just use `std::atomic` anymore. But thinking more about it, for actual device memory, `mov`+`mfence` could be a lot different from `xchg` in performance or even side effects for SC stores. So you're at the mercy of implementation details if you unwisely use std::atomic as a way to get write barriers emitted. Plus you'll break the std::atomic C++ ordering guarantees if you use release stores on weakly-ordered WC memory, so like I said unusable. Maybe there's room to mention this stuff in a section or footnote? IDK. — Peter Cordes, May 14 '20 at 04:17
@Brendan: This Q&A is really for the benefit of people using `std::atomic` on x86, which currently means user-space; at least Linux rolls its own atomics, and I assume the other major kernels do, too. Although I did mention NT stores, so I can expand that footnote. thanks for the suggestion. — Peter Cordes, May 14 '20 at 04:19

Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?

1 Answers1

Linked

Related