is std::atomic::fetch_add a serializing operation on x86-64?

Question

Considering the following code:

std::atomic<int> counter;

/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
/* otherStuff 2 */

Is there an instruction in x86-64 (say less than 5 years old architectures) that would allow otherStuff 1 and 2 be re-ordered across the fetch_add or is it going to be always serializing ?

EDIT:

It looks like this is summarized by "is lock add a memory barrier on x86 ?" and it seems it is not, though I am not sure where to find a reference for that.

It doesn't make much sense to talk about the C++ atomic primitives and specific architectures. You've specifically specified a `std::memory_order_releaxed` constraint and that's all the compiler will necessarily give you. In particular the compiler itself may re-order otherStuff 1/2 with the `fetch_add`. `std::atomic::fetch_add` isn't a serializing operation, regardless of architecture. If you need it to prevent re-ordering, use `memory_order_cst`; that's what it's there for. — davmac, Jun 28 '18 at 15:26
(Correction: `memory_order_seq_cst`, not `memory_order_cst`). — davmac, Jun 28 '18 at 15:37
What makes no sense ? I want to know whether c++ offers constructs that the hardware cannot satisfy. Will I get sequential consistency even though I did not ask for it ? I am told that x86 never reorders writes. Note : I do *not* want it. I want cheap atomic increments. — J.N., Jun 29 '18 at 00:36
"I want to know whether c++ offers constructs that the hardware cannot satisfy" - but a `memory_order_relaxed` constraint is satisfied whether or not the surrounding loads/stores are re-ordered w.r.t. the atomic operation, so your question is not asking that. "I want cheap atomic increments" - the compiler will generally generate as cheap as possible an instruction for the constraints you give it. You are asking if you can use a looser constraint and expect semantics of a tighter constraint, but _it doesn't work like that_ - that's what makes no sense. — davmac, Jun 29 '18 at 07:15
I.e. if you don't want re-ordering, you can use `memory_order_acq_release` in this case to prevent it. If you want to allow re-ordering, use `memory_order_relaxed`. On x86-64 they will most likely generate the same instruction, but in the latter case the compiler has more leeway to move things around. There's no good reason to choose a looser constraint when you require the semantics of a tighter constraint. — davmac, Jun 29 '18 at 07:30
@J.N. "_What makes no sense ?_" In general, mixing high level and low level specification, they don't mix well. You can't reason about a high level C/C++ program from a low level POV. (A high level program is one that has any object access which isn't a volatile access.) — curiousguy, Dec 12 '18 at 00:07

LWimsey · Accepted Answer · 2019-12-18T16:23:21.770

3

First let's look at what the compiler is allowed to do when using std::memory_order_relaxed.
If there are no dependencies between otherStuff 1/2 and the atomic operation, it can certainly reorder the statements. For example:

g = 3;
a.fetch_add(1, memory_order_relaxed);
g += 12;

clang++ generates the following assembly:

lock   addl $0x1,0x2009f5(%rip)        # 0x601040 <a>
movl   $0xf,0x2009e7(%rip)             # 0x60103c <g>

Here clang took the liberty to reorder g = 3 with the atomic fetch_add operation, which is a legitimate transformation.

When using std::memory_order_seq_cst, the compiler output becomes:

movl   $0x3,0x2009f2(%rip)        # 0x60103c <g>
lock   addl $0x1,0x2009eb(%rip)   # 0x601040 <a>
addl   $0xc,0x2009e0(%rip)        # 0x60103c <g>

Reordering of statements does not take place because the compiler is not allowed to do that. Sequential consistent ordering on a read-modify-write (RMW) operation, is both a release and an acquire operation and as such, no (visible) reordering of statements is allowed on both compiler and CPU level.

Your question is whether, on X86-64, std::atomic::fetch_add, using relaxed ordering, is a serializing operation..
The answer is: yes, if you do not take into account compiler reordering.

On the X86 architecture, an RMW operation always flushes the store buffer and therefore is effectively a serializing and sequentially consistent operation.

You can say that, on an X86 CPU, each RMW operation:

is a release operation for memory operations that precede it and is an acquire operation for memory operations that follow it.
becomes visible in a single total order observed by all threads.

edited Dec 18 '19 at 16:23

answered Jun 29 '18 at 04:04

LWimsey

6,189
2
25
53

3

In x86 terminology, "serializing" has a specific meaning which is stronger than just serializing memory A "serializing instruction" like `cpuid` serializes the whole pipeline, including blocking out-of-order execution of non-memory instructions as well as being a full memory barrier ([How many memory barriers instructions does an x86 CPU have?](https://stackoverflow.com/a/50332912)).. Anyway yes, `lock`ed operations are always full memory barriers, and there's no other way to do an atomic-RMW. +1 for compile-time reordering: the C++ memory model applies even when compiling for x86. – Peter Cordes Jun 29 '18 at 06:07
1

Your final two bullet points are weird points because *all* x86 loads and stores do those things. (Unless you use NT stores, or WC memory regions, but that doesn't happen unless you do it on purpose.) All stores (not just the write half of an RMW) become visible in a single total order observed by all threads. And that total order is some interleaving of program order, because StoreStore reordering isn't allowed. – Peter Cordes Jun 29 '18 at 18:18
@PeterCordes About your last comment, the 2 bullet points are related; the point is to show that, **on CPU level**, RMW's are seq/cst operations and therefore, a) have implied acquire/release semantics, b) respect a single, total order - that is, all threads observe those operations in the same order. That is _not_ the case with plain loads and stores, since #StoreLoad reordering is a real thing on X86 – LWimsey Jun 29 '18 at 23:09
1

The *only* difference between normal x86 loads and stores (acq/rel) and what you said about RMW is that the loads are also part of the global total order. That's a very subtle distinction, and if you want to make it you could do some more clearly, IMO. (In [Globally Invisible load instructions](https://stackoverflow.com/a/50617986), I said some interesting stuff about visibility of regular loads). The first point is basically just saying that the load and store parts of RMW operations are at least as strong as regular loads/stores, and you could say that more clearly. – Peter Cordes Jun 29 '18 at 23:17
1

@PeterCordes I agree with your point that the bullet points do not directly pertain to the actual question. I'll make an edit later (first things first, going out for a bike ride now) – LWimsey Jun 29 '18 at 23:44
"_as such, no (visible) reordering of statements is allowed on both compiler and CPU level_" Correct for those statements that have are globally visible, incorrect for those that are local to the thread. While that special case would not affect correctness of MT code, it would affect its timing. – curiousguy Dec 12 '19 at 03:15
@LWimsey "_RMW's are seq/cst operations_" No, consistency is a proper of a program, not of an instruction. – curiousguy Dec 12 '19 at 03:19
"_Reordering of statements does not take place because the compiler is not allowed to do that._" Why not? – curiousguy Dec 12 '19 at 08:04
@curiousguy _"RMW's are seq/cst operations"_ I was referring to how this behaves on `X86` – LWimsey Dec 18 '19 at 16:17
@curiousguy _"Why not?"_ - Because a seq_cst RMW is both an acquire and a release operation. The compiler will not move operations in either direction. – LWimsey Dec 18 '19 at 16:19
@LWimsey "_how this behaves on X86_" The behavior of which set of instructions on x86? – curiousguy Dec 18 '19 at 19:16
"_as such, no (visible) reordering of statements is allowed on both compiler and CPU level._" How is that statement not true in the relaxed case? In each and every cases? When is a reordering "visible" *and* allowed? – curiousguy Dec 19 '19 at 04:11

score 0 · Answer 2 · answered Dec 12 '19 at 03:12

The target architecture

On the X86 architecture, an RMW operation always flushes the store buffer and therefore is effectively a serializing and sequentially consistent operation.

I wish people would stop saying that.

That statement doesn't even make sense, as there is no such thing as "sequentially consistent operation", as "sequential consistency" isn't a property of any operation. A sequentially consistent execution is one where the end result is one where there is an interlacing of operation that gives that result.

What can be said about these RMW operations:

all operations before the RMW have to be globally visible before either the R or W of the RMW is visible
and no operation after the RMW are visible before the RMW is visible.

That is the part before, the RMW, and the part after are run sequential. In other words, there is a full fence before and after the RMW.

Whether that results in a sequential execution for the complete program depends on the nature of all globally visible operations of the program.

Visibility vs. execution ordering

That's in term of visibility. I have no idea whether these processors try to speculatively execute code after the RMW, subject to the correctness requirement that operations are rolled back if there is a conflict with a side effect on a parallel execution (these details tend to be different for different vendors and generations in a given family, unless it's clearly specified).

The answer to your question could be different whether

you need to guarantee correctness of the set of side effect (as in sequential consistency requirement),
or guarantee that benchmarks are reliable,
or that comparative timing CPU version independent: to guarantee something on the results of comparison of timing of different executions (for a given CPU).

High level languages vs. CPU features

The question title is "is std::atomic::fetch_add a serializing operation on x86-64?" of the general form:

"does OP provide guarantees P on ARCH"

where

OP is a high level operation in a high level language
P is the desired property
ARCH is a specific CPU or compiler target

As a rule, the canonical answer is: the question doesn't make sense, OP being high level and target independent. There is a low level/high level mismatch here.

The compiler is bound by the language standard (or rather its most reasonable interpretation), by documented extension, by history... not by the standard for the target architecture, unless the feature is a low level, transparent feature of the high level language.

The canonical way to get low level semantic in C/C++ is to use volatile objects and volatile operations.

Here you must use a volatile std::atomic<int> to even be able to ask a meaningful question about architectural guarantees.

Present code generation

The meaningful variant of your question would use this code:

volatile std::atomic<int> counter;

/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);

That statement will generate an atomic RMW operation which in that case "is a serializing operation" on the CPU: all operations performed before, in assembly code, are complete before the RMW starts; all operation following the RMW wait until the RMW completes to start (in term of visibility).

And then you would need to learn about the unpleasantness of the volatile semantic: volatile applies only to these volatile operation so you would still not get general guarantees about other operations.

There is no guarantee that high level C++ operations before the volatile RMW operations are sequenced before in the assembly code. You would need a "compiler barrier" to do that. These barriers are not portable. (And not needed here as it's a silly approach anyway.)

But then if you want that guarantee, you can just use:

a release operation: to ensure that previous globally visible operations are complete
an acquire operation: to ensure that following globally visible operations do not start before
RMW operation on an object that is visible by multiple threads.

So why not make your RMW operation ack_rel? Then it wouldn't even need to be volatile.

Possible RMW variants in a processor family

Is there an instruction in x86-64 (say less than 5 years old architectures) that would

Potential variants of the instruction set is another sub-question. Vendors can introduce new instructions, and ways to test for their availability at runtime; and compilers can even generate code to detect their availability.

Any RMW feature that would follow the existing tradition (1) of strong ordering of usual memory operation in that family would have to respect the traditions:

Total Store Order: all store operations are ordered, implicitly fenced; in other words, there is a store buffer strictly for non speculative store operations in each core, that is not reordered and not shared between cores;
each store is a release operation (for previous normal memory operations);
loads that are speculatively started are completed in order, and at completion are validated: any early load for a location that was then clobbered in the cache is cancelled and the computation is restarted with the recent value;
loads are acquire operations.

Then any new (but traditional) RMW operation must be both an acquire operation and a release operation.

(Examples for potential imaginary RMW operation to be added in the future are xmult and xdiv.)

But that's futurology and adding less ordered instruction in the future wouldn't violate any security invariants, except potentially against timing based, side channel Spectre-like attacks, that we don't know how to model and reason about in general anyway.

The problem with these questions, even about the present, is that a proof of absence would be required, and for that we would need to know about each variant for a CPU family. That is not always doable, and also, unnecessary if you use the proper ordering in the high level code, and useless if you don't.

(1) Traditions for guarantees of memory operations are guidelines in the CPU design, not guarantees about any feature operation: by definition operations that don't yet exist have no guarantee about their semantics, beside the guarantees of memory integrity, that is, the guarantee that no future operation will break the privileges and security guarantees previously established (no un privileged instruction created in the future can access an un-mapped memory address...).

score -1 · Answer 3 · answered Jun 28 '18 at 15:11

-1

When using std::memory_order_relaxed the only guarantee is that the operation is atomic. Anything around the operation can be reordered at will by either the compiler or the CPU.

From https://en.cppreference.com/w/cpp/atomic/memory_order:

Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed (see Relaxed ordering below)

answered Jun 28 '18 at 15:11

Xaqq

4,308
2
25
38

But if you used sequential consistency as the memory order, then it would be serialised - right? – Jesper Juhl Jun 28 '18 at 15:20
@JesperJuhl I don't believe so. My understanding is that it would be serialized wrt operations performing load with `acquire` semantic or write with `release` semantic, or wrt operations carrying a dependency on the `atomic`. An unrelated call to `printf("Hello World\n");` could be reordered. However `int x = my_atomic; printf("%d\n", x);` wouldn't be reordered because `x` now depends on `my_atomic`. – Xaqq Jun 28 '18 at 15:27
that makes sense. – Jesper Juhl Jun 28 '18 at 15:31
@Xaqq since `memory_order_seq_cst` implies acquire for loads and release for stores, it would give both in this case (fetch_add is both). That in turn means that loads and stores (even non-atomic) cannot be moved past it in either direction, if I understand correctly. – davmac Jun 28 '18 at 15:36
@Davmac I think you're right. Acquire / Release prevent reordering in the same thread. – Xaqq Jun 28 '18 at 16:11
1

@JesperJuhl I think I was wrong and Davmac is correct. – Xaqq Jun 28 '18 at 16:14
1

So what is the x86 instructions that doing a fetch_add without serializing ? I understand that the compiler will take all the freedom in the world. I want to be sure the CPU will as well. – J.N. Jun 29 '18 at 00:38
@J.N.: This answer is not useful; it doesn't answer the question about what x86-64 can do. x86-64 asm only has seq-cst for RMW, so weaker ordering only helps for stores (all x86 stores except NT stores are release stores, but not sequential-release), And of course for compile-time reordering, when compiling for x86-64. – Peter Cordes Jun 29 '18 at 06:12