The target architecture
On the X86 architecture, an RMW operation always flushes the store
buffer and therefore is effectively a serializing and sequentially
consistent operation.
I wish people would stop saying that.
That statement doesn't even make sense, as there is no such thing as "sequentially consistent operation", as "sequential consistency" isn't a property of any operation. A sequentially consistent execution is one where the end result is one where there is an interlacing of operation that gives that result.
What can be said about these RMW operations:
- all operations before the RMW have to be globally visible before either the R or W of the RMW is visible
- and no operation after the RMW are visible before the RMW is visible.
That is the part before, the RMW, and the part after are run sequential. In other words, there is a full fence before and after the RMW.
Whether that results in a sequential execution for the complete program depends on the nature of all globally visible operations of the program.
Visibility vs. execution ordering
That's in term of visibility. I have no idea whether these processors try to speculatively execute code after the RMW, subject to the correctness requirement that operations are rolled back if there is a conflict with a side effect on a parallel execution (these details tend to be different for different vendors and generations in a given family, unless it's clearly specified).
The answer to your question could be different whether
- you need to guarantee correctness of the set of side effect (as in sequential consistency requirement),
- or guarantee that benchmarks are reliable,
- or that comparative timing CPU version independent: to guarantee something on the results of comparison of timing of different executions (for a given CPU).
High level languages vs. CPU features
The question title is "is std::atomic::fetch_add
a serializing operation on x86-64?" of the general form:
"does OP provide guarantees P on ARCH"
where
- OP is a high level operation in a high level language
- P is the desired property
- ARCH is a specific CPU or compiler target
As a rule, the canonical answer is: the question doesn't make sense, OP being high level and target independent. There is a low level/high level mismatch here.
The compiler is bound by the language standard (or rather its most reasonable interpretation), by documented extension, by history... not by the standard for the target architecture, unless the feature is a low level, transparent feature of the high level language.
The canonical way to get low level semantic in C/C++ is to use volatile objects and volatile operations.
Here you must use a volatile std::atomic<int>
to even be able to ask a meaningful question about architectural guarantees.
Present code generation
The meaningful variant of your question would use this code:
volatile std::atomic<int> counter;
/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
That statement will generate an atomic RMW operation which in that case "is a serializing operation" on the CPU: all operations performed before, in assembly code, are complete before the RMW starts; all operation following the RMW wait until the RMW completes to start (in term of visibility).
And then you would need to learn about the unpleasantness of the volatile semantic: volatile applies only to these volatile operation so you would still not get general guarantees about other operations.
There is no guarantee that high level C++ operations before the volatile RMW operations are sequenced before in the assembly code. You would need a "compiler barrier" to do that. These barriers are not portable. (And not needed here as it's a silly approach anyway.)
But then if you want that guarantee, you can just use:
- a release operation: to ensure that previous globally visible operations are complete
- an acquire operation: to ensure that following globally visible operations do not start before
- RMW operation on an object that is visible by multiple threads.
So why not make your RMW operation ack_rel? Then it wouldn't even need to be volatile.
Possible RMW variants in a processor family
Is there an instruction in x86-64 (say less than 5 years old
architectures) that would
Potential variants of the instruction set is another sub-question. Vendors can introduce new instructions, and ways to test for their availability at runtime; and compilers can even generate code to detect their availability.
Any RMW feature that would follow the existing tradition (1) of strong ordering of usual memory operation in that family would have to respect the traditions:
- Total Store Order: all store operations are ordered, implicitly fenced; in other words, there is a store buffer strictly for non speculative store operations in each core, that is not reordered and not shared between cores;
- each store is a release operation (for previous normal memory operations);
- loads that are speculatively started are completed in order, and at completion are validated: any early load for a location that was then clobbered in the cache is cancelled and the computation is restarted with the recent value;
- loads are acquire operations.
Then any new (but traditional) RMW operation must be both an acquire operation and a release operation.
(Examples for potential imaginary RMW operation to be added in the future are xmult
and xdiv
.)
But that's futurology and adding less ordered instruction in the future wouldn't violate any security invariants, except potentially against timing based, side channel Spectre-like attacks, that we don't know how to model and reason about in general anyway.
The problem with these questions, even about the present, is that a proof of absence would be required, and for that we would need to know about each variant for a CPU family. That is not always doable, and also, unnecessary if you use the proper ordering in the high level code, and useless if you don't.
(1) Traditions for guarantees of memory operations are guidelines in the CPU design, not guarantees about any feature operation: by definition operations that don't yet exist have no guarantee about their semantics, beside the guarantees of memory integrity, that is, the guarantee that no future operation will break the privileges and security guarantees previously established (no un privileged instruction created in the future can access an un-mapped memory address...).