Unlike ARM and many other RISCs, x86 doesn't have load-linked / store-conditional; architecturally it has stuff like lock add byte [rdi], 1
or lock cmpxchg [rdi], ecx
for atomic RMW. See Can num++ be atomic for 'int num'? for some details of the semantics and CPU architecture.
See also x86 equivalent for LWARX and STWCX - arbitrary atomic RMW operations can by synthesized with a CAS (lock cmpxchg
) retry loop. Unlike LL/SC, it is susceptible to ABA problems, but CAS is the other major way of providing a building block for atomic stuff.
Internally on x86 modern CPUs, this probably works by running a load uop that also "locks" that cache line. (Instead of arming a monitor so a later SC will fail, the "cache lock" prevents MESI responses until a store-unlock, preventing things that would have made an SC fail on an LL/SC machine.)
Taking a cache lock on just that line in MESI Modified state (instead of the traditional bus lock) depends on it being cacheable memory, and being aligned or at least not splitting across a cache-line boundary.
x86's cmov
instruction only has one form, with a register destination, not memory. cmovcc reg, reg/mem
. Even with a memory source, it's an unconditional load to feed an ALU select operation, so will segfault on a bad address even if the condition is false. (Unlike ARM predicated instructions, where the whole instruction is NOPed out on a false condition.)
I guess you could say lock cmpxchg [mem], reg
is a conditional store, but the only condition possible is whether the old contents of memory match AL/AX/EAX/RAX. https://www.felixcloutier.com/x86/cmpxchg
rep stosb/w/d/q
is also a conditional store, if you arrange for RCX to be 0 or 1 (e.g. xor ecx,ecx/ set FLAGS /
setcc cl`); microcode branching isn't branch-predicted so it's a bit different from normal branching.
AVX vmaskmovps
or AVX-512 masked stores are truly conditional stores, based on a mask condition. My answer on another Q&A about cmov
discusses the conditional-load equivalents of these, along with the fact that cmov
is not a conditional load, it's an ALU select that needs all 3 inputs (FLAGS and 2 integers).
Conditional stores are rare in most ISAs other than the SC part of a LL/SC pair. 32-bit ARM is an exception to the rule; see Why are conditionally executed instructions not present in later ARM instruction sets? for why AArch64 dropped it.
AVX and AVX-512 masked stores do not stall the pipeline. See https://agner.org/optimize/ and https://uops.info/ for some performance numbers, plus Intel's optimization manual. They suppress faults on masked elements. Store-forwarding from them if you reload before they commit to L1d might stall that load, but not the whole pipeline.
Intel APX (Advanced Performance Extensions) adds REX2 and EVEX prefixes for legacy integer instructions like sub
, and some new encodings of cmov
that actually do suppress faults on load with a false condition, and a conditional-store version. They use the mnemonic CFCMOVcc
, Conditionally Faulting CMOV. Intel finally decided to make an extension that required 64-bit mode, using some of the coding space freed up by removing BCD and other opcodes.
Presumably the hardware handles conditional load/store similar to AVX-512 masking.