How does x86 handle store conditional instructions?

Question

I am trying to find out what an x86 processor does when it encounters a store conditional instruction. For instance does it stall the front end of the pipeline and wait for the ROB buffer to become empty before it stops stalling the front end and execute the SC? Basically does it force the processor to become non speculative...

Thanks

score 5 · Answer 1 · answered Nov 04 '09 at 20:19

5

I'm guessing that you're referring to the CMOVcc instructions.

I don't know about older x86 processors, but modern ones (ever since they became speculative and out of order) implement conditional stores as:

old value = mem[dest address]
if (condition) 
    mem[dest address] = new value
else
    mem[dest address] = old value

The condition part can be implemented in hardware like this:

      cond
    |\ |
----| \|
new |  \
    |   |    dest
    |   |---------
    |   |     |
  __|  /      |
 |  | /       |
 |  |/        |
 |____________|

So there's no need to break speculation. A store will in fact take place. The condition determines if the data to be written will be the old value or a new one.

answered Nov 04 '09 at 20:19

Nathan Fellman

122,701
101
260
319

`cmov` is only available with a register destination. So you mean `cmovcc eax, [mem]` / `mov [mem], eax`? A store only happens because you used a `mov` instruction. `cmpxchg` is a conditional store but you don't get to choose the condition. (With `lock cmpxchg` doing it atomically.) – Peter Cordes Feb 12 '23 at 21:32
@PeterCordes: yes, you're right. My mistake. However, the same logic would apply to masked stores in MMX, SSE, AVX and AVX3 instruction sets. – Nathan Fellman Apr 17 '23 at 05:53
Your pseudocode implies a (possibly non-atomic) RMW, doing a load and then an unconditional store of a value. But that's not how SIMD masked stores work. It doesn't store at all when the mask is false, so it doesn't page-fault on unmapped or read-only pages. But even when fault-suppression isn't needed, a `vpmaskmovd` with an all-false mask is microarchitecturally different from an atomic RMW. (Which would be the only way your pseudocode could be thread-safe, not stepping on stores from another thread to an element with a false mask). – Peter Cordes Apr 17 '23 at 06:49
A `vpmaskmovd` with a false mask doesn't have to wait for the store buff to drain (to maintain StoreStore ordering). And performance is somewhat different when a reload that partially overlaps one byte of a false-masked store vs. a true-masked store (normal store-forwarding stall). I tested on Skylake. With RDI pointing to a writeable buffer (that's already been written once unconditionally), `vpmaskmovd [rdi], xmm1, xmm2` / `vmovd xmm2, [rdi+15]` in a loop. With xmm1=all-true, 1 iter per 15 cycles, and we get counts for `ld_blocks.store_forward`. With false, no counts, and 12c/iter – Peter Cordes Apr 17 '23 at 06:54

Peter Cordes · Answer 2 · 2023-07-28T09:02:23.000

Unlike ARM and many other RISCs, x86 doesn't have load-linked / store-conditional; architecturally it has stuff like lock add byte [rdi], 1 or lock cmpxchg [rdi], ecx for atomic RMW. See Can num++ be atomic for 'int num'? for some details of the semantics and CPU architecture.

See also x86 equivalent for LWARX and STWCX - arbitrary atomic RMW operations can by synthesized with a CAS (lock cmpxchg) retry loop. Unlike LL/SC, it is susceptible to ABA problems, but CAS is the other major way of providing a building block for atomic stuff.

Internally on x86 modern CPUs, this probably works by running a load uop that also "locks" that cache line. (Instead of arming a monitor so a later SC will fail, the "cache lock" prevents MESI responses until a store-unlock, preventing things that would have made an SC fail on an LL/SC machine.)

Taking a cache lock on just that line in MESI Modified state (instead of the traditional bus lock) depends on it being cacheable memory, and being aligned or at least not splitting across a cache-line boundary.

x86's cmov instruction only has one form, with a register destination, not memory. cmovcc reg, reg/mem. Even with a memory source, it's an unconditional load to feed an ALU select operation, so will segfault on a bad address even if the condition is false. (Unlike ARM predicated instructions, where the whole instruction is NOPed out on a false condition.)

I guess you could say lock cmpxchg [mem], reg is a conditional store, but the only condition possible is whether the old contents of memory match AL/AX/EAX/RAX. https://www.felixcloutier.com/x86/cmpxchg

rep stosb/w/d/q is also a conditional store, if you arrange for RCX to be 0 or 1 (e.g. xor ecx,ecx/ set FLAGS /setcc cl`); microcode branching isn't branch-predicted so it's a bit different from normal branching.

AVX vmaskmovps or AVX-512 masked stores are truly conditional stores, based on a mask condition. My answer on another Q&A about cmov discusses the conditional-load equivalents of these, along with the fact that cmov is not a conditional load, it's an ALU select that needs all 3 inputs (FLAGS and 2 integers).

Conditional stores are rare in most ISAs other than the SC part of a LL/SC pair. 32-bit ARM is an exception to the rule; see Why are conditionally executed instructions not present in later ARM instruction sets? for why AArch64 dropped it.

AVX and AVX-512 masked stores do not stall the pipeline. See https://agner.org/optimize/ and https://uops.info/ for some performance numbers, plus Intel's optimization manual. They suppress faults on masked elements. Store-forwarding from them if you reload before they commit to L1d might stall that load, but not the whole pipeline.

Intel APX (Advanced Performance Extensions) adds REX2 and EVEX prefixes for legacy integer instructions like sub, and some new encodings of cmov that actually do suppress faults on load with a false condition, and a conditional-store version. They use the mnemonic CFCMOVcc, Conditionally Faulting CMOV. Intel finally decided to make an extension that required 64-bit mode, using some of the coding space freed up by removing BCD and other opcodes.

Presumably the hardware handles conditional load/store similar to AVX-512 masking.

score 0 · Answer 3 · answered Nov 04 '09 at 00:03

A (generic) x86 processor does none of the things you mentioned. It just fetches one instruction after another and executes them.

Everything else is handled transparently and heavily depends on which processor you are looking at, so there is no generic answer to your question.

If you are interested in methods around stalling problems you should start at the wikipedia page on x86 (register renaming to mention one. Just throw away results from the non-taken branch).

How does x86 handle store conditional instructions?

3 Answers3

Linked