x86 relaxed ordering performance?

Question

Since intel provides strong hardware memory model, is there any advantage at all to using "memory_order_relaxed" in a C++11 program? Or just leave it at default "sequential consistent" since it makes no difference?

To paraphrase Stroustrup, the non-default orderings are meant for experts. — DanielKO, Jan 20 '15 at 16:29
The downside to using things like memory_ordered_relaxed on an x86 is that you have idea if the code would still be correct on when ported to another architecture because you cannot test it until you port it. — Nevin, Jan 21 '15 at 22:46
It's also worth mentioning that memory orders other than relaxed act as compiler barriers too, preventing reordering of memory accesses during optimization. — Pezo, Jun 15 '19 at 16:54
Does this answer your question? [Are memory orderings: consume, acq\_rel and seq\_cst ever needed on Intel x86?](https://stackoverflow.com/questions/61719680/are-memory-orderings-consume-acq-rel-and-seq-cst-ever-needed-on-intel-x86) — user, Jan 23 '21 at 11:45
See also [this answer](https://stackoverflow.com/a/53805377/3075942). — user, Feb 01 '21 at 14:31

Jerry Coffin · Accepted Answer · 2019-06-16T17:14:45.417

Like most answers in computer science, the answer to this is "that depends."

First of all, the idea that sequentially consistent ordering never carries any penalty is incorrect. Depending on your code (and possibly compiler), it can and will carry a penalty.

Second, to make intelligent decisions about the memory ordering constraints, you need to think about (and understand) how you're using the data involved.

memory_order_relaxed is useful for something like a standalone counter that needs to be atomic, but isn't directly related to something else so it doesn't need to be consistent with any "something else". The typical example would be a reference count, such as in shared_ptr or some older implementations of std::string. In this case, we just need to assure that the counter is incremented and decremented atomically, and that modifications to it are visible to all threads. But, particularly, there's not any related data with which it needs to remain consistent, so we don't care much about it's ordering with respect to anything else.

Sequentially Consistent ordering is pretty much at the opposite extreme. It's probably the easiest to apply--you write the code just about like it was single threaded, and the implementation assures that it works correctly (that's not to say you don't have to take threading into account at all, but sequentially consistent ordering generally requires the least thought about it, but is also generally the slowest model).

Acquire/release consistency are normally used when you have two or more related pieces of information, and you need to assure that one only becomes visible before/after the other. For one example that I dealt with recently, let's assume you're building something roughly like an in-memory database. You have some data, and you have some metadata (and you're storing each more or less separately).

The metadata is used (among other things) for searching the database. We want to assure that if somebody finds some particular data that the data they found will actually be present in the database.

To assure this, we want to assure that the data is always present before the metadata and continues to exist at least as long as the metadata. The database would be inconsistent if somebody could search the database using the metadata, and find some data it wants to use, when that data isn't actually present.

So in this case, when we're adding a record, we need to assure that we add the data first, then add the metadata--and the compiler must not rearrange the two. Likewise, when we're deleting a record, we need to delete the metadata (so nobody will find the data), then delete the data itself. In the case of the data itself, chances are we have a reference count to keep track of how many clients are currently accessing that data, to assure that we don't delete it while somebody is trying to use it.

So in this case, we can use acquire/release semantics for the metadata and data, and relaxed ordering for the reference count. Or, if we want to keep our code as simple as possible, we could use sequential consistency throughout--even though it might (and probably will) carry at least some penalty.

`mo_seq_cst` still has extra cost on x86 for pure-store operations, like `var = 1`! It's a full memory barrier (e.g. using `xchg` for a seq-cst store, or usually worse, `mov` + `mfence`, instead of just a plain `mov`). On x86, `mo_release` / `mo_acquire` is "free" in asm vs. `mo_relaxed`. The only cost (with current compilers) is possibly blocking compile-time reordering with non-atomic variables. — Peter Cordes, Jun 15 '19 at 17:17
@PeterCordes: Yeah--what I originally wrote was based solely on the consequences that would result if the OP's assumption was true. Unfortunately it wasn't, so most of the answer was invalid. I've rewritten to (I think) make it a bit more accurate. — Jerry Coffin, Jun 16 '19 at 17:17

score 2 · Answer 2 · answered Jan 20 '15 at 16:22

Always use the minimum guarantees you need to make your code correct.

No more, and no less.

That way, you can avoid any unneccessary dependencies on the implementation, thus reducing any porting costs, and will still get the fastest program possible.

Of course, if you are sure you won't ever care about porting any of your code, taking stronger guarantees where you know it won't matter on your platforms may make prooving it correct easier.
Being harder to misuse, easier to reason about or shorter are perfectly accepted reasons for using less performant constructs too.

If you don't have a weakly-ordered ISA to actually test on, it's arguably dangerous to use weaker orderings that you can't test. `mo_release` / `mo_acquire` / `mo_acq_rel` is the weakest you can test on x86. (But is cheaper for pure stores than `mo_seq_cst`.) — Peter Cordes, Jun 15 '19 at 17:19

x86 relaxed ordering performance?

2 Answers2