Does x86-SSE-instructions have an automatic release-acquire order?

Question

As we know from from C11-memory_order: http://en.cppreference.com/w/c/atomic/memory_order

And the same from C++11-std::memory_order: http://en.cppreference.com/w/cpp/atomic/memory_order

On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire)

But is this true for x86-SSE-instructions (except of [NT] - non-temporal, where we always must use L/S/MFENCE)?

Here said, that "sse instructions ... is no requirement on backwards compatibility and memory order is undefined". It is believed that the strict orderability left for compatibility with older versions of processors x86, when it was needed, but new commands, namely SSE(except of [NT]) - deprived automatically release-acquire of order, is it?

I didn't mean that all sse instructions break memory ordering, but that some might do it. And gcc can't know if an external function contains problematic instructions. See the recommendations in the of section 8.2.5 in the referred document. "The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors do not implement a strong memory-ordering model, except when using the UC memory type." — smossen, Sep 30 '13 at 14:13
@smossen But if we talk about "release-acquire ordering is automatic" as "strong memory-ordering model", and if you mean, that "release-acquire ordering is automatic" don't work for some x86-instructions and need `MFENCE`, then also `std::memory_order_acq_req` must use `MOV+MFENCE` for these some x86-instructions, is it right? — Alex, Sep 30 '13 at 14:30
I'm not sure if I understand you correctly. Do you have an example where std::memory_order_acq_req is used together with "new" instructions? — smossen, Sep 30 '13 at 14:46
@smossen Why do you want "new", because `strcpy` doesn't use "new" in your example? http://stackoverflow.com/a/19088403/1558037 But you can see "new" in string `std::string* p = new std::string("Hello");` **in example for Release-Acquire ordering**, or if you mean "new SSE instructions", that `std::string` can have their, by link from my question: http://en.cppreference.com/w/cpp/atomic/memory_order — Alex, Sep 30 '13 at 14:51
Sorry for my confusion. I meant a new x86 instruction, i.e. an instruction introduced in some of the sse extensions potentially breaking strong ordering. — smossen, Sep 30 '13 at 14:57
@smossen If short, I want to say, that if sequential consistency need `MFENCE` for "new SSE-instructions", then and acquire-release need `MFENCE` for the same "new SSE-instructions". And if acquire-release need not `MFENCE`, then and sequential consistency need not it (only need `SFENCE` after `STORE`). — Alex, Sep 30 '13 at 15:11

Alexey Kukanov · Accepted Answer · 2014-12-22T19:09:58.687

10

Here is an excerpt from Intel's Software Developers Manual, volume 3, section 8.2.2 (the edition 325384-052US of September 2014):

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the following exceptions:

writes executed with the CLFLUSH instruction;

streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and

string operations (see Section 8.2.4.1).

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

Reads cannot pass earlier LFENCE and MFENCE instructions.

Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

LFENCE instructions cannot pass earlier reads.

SFENCE instructions cannot pass earlier writes.

MFENCE instructions cannot pass earlier reads or writes.

The first three bullets describe the release-acquire ordering, and the exceptions are explicitly listed there. As you might see, only cacheability control instructions (MOVNT*) are in the exception list, while the rest of SSE/SSE2 and other vector instructions obey to the general memory ordering rules, and do not require use of [LSM]FENCE.

edited Dec 22 '14 at 19:09

answered Dec 04 '14 at 19:59

Alexey Kukanov

12,479
2
36
55

Thank you! But `MOVNTI`, `MOVNTDQ` and `MOVNTPD` are **SSE2**-instructions from "emmintrin.h": https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/index.htm#GUID-7855D7EB-948F-4C4F-9442-CC821CAF83B6.htm And `MOVNTQ`, `MOVNTPS` are **SSE**-instructions from "xmmintrin.h": https://software.intel.com/en-us/node/514313 Or does it mean that only these 5 SSE (in fact, little more - all **NT**(Non-Temporal) SSE/AVX which **cacheability**) instructions can reordered with other writes, but no all of SSE/AVX? – Alex Dec 05 '14 at 11:03
5

Exactly, only these non-temporal move (\*MOVNT\*) instructions can be reordered, but not the rest of SSE. I will clarify it in the answer. – Alexey Kukanov Dec 05 '14 at 12:13
Does "read" and "write" subsume instructions like `add` with memory operand? – Kerrek SB Jul 15 '15 at 13:07
@KerrekSB yes, this would have a load component that would be allocated in the load buffer even if the instruction is microfused in the ROB – Lewis Kelsey Mar 03 '19 at 05:48

BeeOnRope · Answer 2 · 2017-09-01T18:47:14.033

2

It is true that normal¹ SSE load and store instructions, as well the implied load when using a memory source operand, have the same acquire and release behavior in terms of ordering as normal loads and stores of GP registers.

They are not, however, generally useful directly to implement std::memory_order_acquire or std::memory_order_release operations on std::atomic objects larger than 8 bytes because there is no guarantee of atomicity for SSE or AVX loads and stores of larger than 8 bytes. The missing guarantee isn't just theoretical: there are several implementations (including brand new ones like AMD's Ryzen) that split large loads or stores up into two smaller ones.

¹ I.e., those not listed in the exception list in the accepted answer: NT stores, clflush and string operations.

edited Sep 01 '17 at 18:47

answered Sep 01 '17 at 16:32

BeeOnRope

60,350
16
207
386

8-byte SSE2 `movq` is atomic. `gcc -m32 -msse2` uses it to implement `std::atomic` load/store. See also `atomic` stuff in this Q&A: https://stackoverflow.com/questions/45055402/atomic-double-floating-point-or-sse-avx-vector-load-store-on-x86-64. Also note that atomic access to L1D doesn't guarantee that the cache-coherency protocol keeps those stores atomic. The answer you linked shows that K10 has atomic 16B ops within a single socket, but transferring lines between sockets over HyperTransport can cause tearing at 8B boundaries. – Peter Cordes Sep 01 '17 at 17:49
We know that tearing within cache lines isn't a problem on Intel, though, because they guarantee (for P6 and later) that any 8B or smaller load/store fully contained within a single cache line is atomic. But AMD only guarantees that any cached access that doesn't cross an 8B boundary is atomic: https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic – Peter Cordes Sep 01 '17 at 17:50
1

@peter you can read the above as only talking about operations larger than 8 bytes. And, yes, I linked that answer in part because it shows you shouldn't even try to reason about the hardware if the manufacturer themselves doesn't guarantee things - and the hypertransport hole was a great example of that (throwing off everyone who reasoned based on bus width and cache behavior). – BeeOnRope Sep 01 '17 at 17:51
BTW, I *think* aligned vector loads/store can't cause tearing within a single 4B/8B element. So you could use SIMD for `mo_relaxed` operations on `atomic shared_array[]`. The Intel manuals definitely don't guarantee that, just saying that wider vectors may be handled with multiple accesses. – Peter Cordes Sep 01 '17 at 18:00
Good question! I hadn't thought of that. The Operation section of the manual (http://felixcloutier.com/x86/VPGATHERDD:VPGATHERQD.html) describes it in terms of `FETCH_32BITS(DATA_ADDR);` which implies that it's separate 32 or 64-bit accesses. But IDK if stuff like masked loads might get described the same way. Hmm, yeah [`vmaskmovps`](http://felixcloutier.com/x86/VMASKMOV.html) has a similar description with stuff like `Load_32(mem + 4) ELSE 0`. – Peter Cordes Sep 01 '17 at 18:52
We know (at least for the case with no masked pagefaults) that `vmaskmovps` is 1 load-port uop and 2 port5 uops on Haswell. Which tells us basically nothing, because we already know Haswell can atomically read 32B from an L1D cache line, which has more atomicity than doing separate 4B loads for each element. It doesn't tell us whether a CPU is required to be atomic within each element, although I think that's implied. The store case is a single store-addr/store-data uop pair, plus two ALU uops. – Peter Cordes Sep 01 '17 at 18:59
Posted as a question: [Per-element atomicity of vector load/store and gather/scatter?](https://stackoverflow.com/questions/46012574/per-element-atomicity-of-vector-load-store-and-gather-scatter) – Peter Cordes Sep 02 '17 at 09:56

Does x86-SSE-instructions have an automatic release-acquire order?

2 Answers2

Linked