C++11 atomic x86 memory ordering

Question

In one of the docs for atomic variables in C++0x, when describing memory order, it mentions:

Release-Acquire Ordering

On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected...

First is it true, that x86 follows strict memory ordering? Seems very inefficient to always impose this. Means every write and read has a fence?

Also, if I have an aligned int, on an x86 system, do the atomic variables serve any purpose at all?

score 12 · Accepted Answer · answered Aug 06 '12 at 21:41

12

Yes, it's true that x86 has strict memory ordering, see Volume 3A, Chapter 8.2 of the Intel manuals. Older x86 processors such as the 386 provided truly strict ordering (called strong ordering) semantics, while more modern x86 processors have slightly relaxed conditions in a few cases, but nothing you need to worry about. For example, the Pentium and 486 allow read cache misses to go ahead of buffered writes when the writes are cache hits (and are therefore to different addresses from the reads).

Yes, it can be inefficient. Sometimes high-performance software is written only for other architectures with looser memory ordering requirements because of this.

Yes, atomic variables still serve a purpose on x86. They have special semantics with the compiler such that a typical read-modify-write operation happens atomically. If you have two threads incrementing an atomic variable (by which I mean a variable of type std::atomic<T> in C++11) simultaneously, you can be assured that the value will be 2 larger; without std::atomic, you might end up with the wrong value because one thread cached the current value in a register while performing the increment, even though the store to memory is atomic on x86.

answered Aug 06 '12 at 21:41

Adam Rosenfield

390,455
97
512
589

1

Thanks. What you describe in that last para can be achieved by "volatile". Im just wondering why they went to such lengths to add it to the standard, when it seems the majority of architectures already have strong ordering. – excalibur Aug 06 '12 at 21:48
3

@excalibur - Using `volatile` only works with *some* compilers (as an extension). That's why `atomic<>` was added to the standard. – Bo Persson Aug 06 '12 at 22:21
5

@excalibur: Nope, `volatile` doesn't necessarily make read-modify-write cycles atomic. If you increment `x++` a `volatile` variable, the compiler can read `x` into a register, increment the register, then store `x` back to memory. – Adam Rosenfield Aug 07 '12 at 04:15
1

Even stronger: the observable behavior of `volatile` _must_ be a separate read and write. – MSalters Aug 07 '12 at 06:56
@Adam:read-modify-write...are you talking about atomic_compare_exchange? If I simply did a x=x+1 with 2 threads and x is atomic, wouldnt I still have the same behavior as a non-atomic? – excalibur Aug 07 '12 at 13:37
2

@excalibur: No, I'm talking about `operator++`. You can't write `x=x+1` with `std::atomic` because those objects are not copy-assignable, for precisely the reason that doing so is non-atomic. – Adam Rosenfield Aug 07 '12 at 15:10

score 5 · Answer 2 · answered Aug 06 '12 at 21:23

It is true that on x86 all stores have release and all loads have acquire semantics.

That doesn't and shouldn't affect the way you write C++: To write concurrent, race-free code you have to use either std::atomic constructions or locks.

What the architectural details mean is that on x86 there will be very little or no extra code generated for operations on atomic word-sized types as long as you ask for at most acquire/release ordering. (Sequential consistency will emit mfence instructions, though.) However, you still must use the C++ atomic types and cannot just omit them in order to have a correct, well-formed program. One important feature of atomic variables is that they prevent compiler reodering, which is essential to the correctness of your program.

(Pre-C++11, you would have had to use compiler-provided extensions such as GCC's __sync_* suite of functions, which would make the compiler behave correctly. If you really wanted to use naked variables, you would at least have to insert compiler barriers yourself.)

"sequential consistency will emit mfence" - if its strongly ordered, dosent it mean fences are already present? how would acquire/release semantic work otherwise? — excalibur, Aug 06 '12 at 21:50
@excalibur: acquire/release are only semi-permeable fences. `mfence` is a full memory barrier, in both directions. It's very expensive, and that's why you would always want to relax to a/r whenever possible (which is free on x86). — Kerrek SB, Aug 06 '12 at 22:18

score 1 · Answer 3 · answered Aug 06 '12 at 21:19

There's a nice table of the different re-ordering operations which can occur, and that (for example) x86 does very few of them. Other architectures (notoriously Alpha) do almost anything.

For the memory models are defined by the standard, x86 et al are inherently compliant.

Your question about atomic variables has a slightly different answer. Any modification to a variable involves a race condition, such that when multiple threads update the same variable, an update can be lost. Atomic variables are defined such that they are the correct type for atomic operations, which eliminate this race condition. So one of their purposes is other than for ordering.

score 1 · Answer 4 · answered Apr 01 '16 at 14:33

Note that release/acquire semantics do not necessarily imply a mfence after each instruction. On x86 holds as can be seen in the manual referenced by @Adam Rosenfield or with a quick look on Wikipedia. Nevertheless x86 has release semantics for stores and acquire semantics for loads.

From Kerrek SB's Answer:

What the architectural details mean is that on x86 there will be very little or no extra code generated for operations on atomic word-sized types as long as you ask for at most acquire/release ordering. (Sequential consistency will emit mfence instructions, though.)

Note that sequential consistency is the default! (See for example cppreference).

This means that...

#include <atomic>
#include <cassert>
#include <string>

std::atomic<std::string*> ptr;

void producer()
{
    std::string* p  = new std::string("Hello");
    ptr = p;
}

void consumer()
{
    std::string* p2;
    while (!(p2 = ptr))
        ;
    assert(*p2 == "Hello"); // never fails
}

(g++ -std=c++11 -S -O3 on x86)

... will actually result in an mfence being emitted in the producer function to account for the aforementioned relaxation on x86 ().

Whereas for...

#include <atomic>
#include <cassert>
#include <string>

std::atomic<std::string*> ptr;

void producer()
{
    std::string* p  = new std::string("Hello");
    ptr.store(p, std::memory_order_release);
}

void consumer()
{
    std::string* p2;
    while (!(p2 = ptr.load(std::memory_order_acquire)))
        ;
    assert(*p2 == "Hello"); // never fails
}

(g++ -std=c++11 -S -O3 on x86)

...no mfence will be inserted because x86 has release semantics for stores and acquire semantics for loads.

C++11 atomic x86 memory ordering

4 Answers4

Linked