ISO C++
In ISO C++, no, release
is the minimum for the writer side of doing some (possibly non-atomic) stores and then storing a data_ready
flag. Or for locking / mutual exclusion, to keep loads before a release store and stores after an acquire load (no LoadStore reordering). Or anything else happens-before gives you. (C++'s model works in terms of guarantees on what a load can or must see, not in terms of local reordering of loads and stores from a coherent cache. I'm talking about how they're mapped into asm for normal ISAs.) acq_rel
RMWs or seq_cst
stores or RMWs also work, but are stronger than release
.
Asm with weaker guarantees that might be sufficient for some cases
In asm for some platform, perhaps there might be something weaker you could do, but it wouldn't be fully happens-before. I don't think there are any requirements on release which are superfluous to happens-before and normal acq/rel synchronization. (https://preshing.com/20120913/acquire-and-release-semantics/).
Some common use cases for acq/rel sync only needs StoreStore ordering on the writer side, LoadLoad on the reader side. (e.g. producer / consumer with one-way communication, non-atomic stores and a data_ready
flag.) Without the LoadStore ordering requirement, I could imagine either the writer or reader being cheaper on some platforms.
Perhaps PowerPC or RISC-V? I checked what compilers do on Godbolt for a.load(acquire)
and a.store(1, release)
.
# clang(trunk) for RISC-V -O3
load(std::atomic<int>&): # acquire
lw a0, 0(a0) # apparently RISC-V just has barriers, not acquire *operations*
fence r, rw # but the barriers do let you block only what is necessary
ret
store(std::atomic<int>&): # release
fence rw, w
li a1, 1
sw a1, 0(a0)
ret
If fence r
and/or fence w
exist and are ever cheaper than fence r,rw
or fence rw, w
, then yes, RISC-V can do something slightly cheaper than acq/rel. Unless I'm missing something, that would still be strong enough if you just want loads after an acquire load see stores from before a release store, but don't care about LoadStore: Others loads staying before a release store, and others stores staying after an acquire load.
CPUs naturally want to load early and store late to hide latencies, so it's usually not much of a burden to actually block LoadStore reordering on top of blocking LoadLoad or StoreStore. At least that's true for an ISA as long as it's possible to get the ordering you need without having to use a much stronger barrier. (i.e. when the only option that meets the minimum requirement is far beyond it, like 32-bit ARMv7 where you'd need a dmb ish
full barrier that also blocked StoreLoad.)
release
is free on x86; other ISAs are more interesting.
memory_order_release
is basically free on x86, only needing to block compile-time reordering. (See C++ How is release-and-acquire achieved on x86 only using MOV? - The x86 memory model is program order plus a store-buffer with store forwarding).
x86 is a silly choice to ask about; something like PowerPC where there are multiple different choices of light-weight barrier would be more interesting. Turns out it only needs one barrier each for acquire and release, but seq_cst needs multiple different barriers before and after.
PowerPC asm looks like this for load(acquire) and store(1,release) -
load(std::atomic<int>&):
lwz %r3,0(%r3)
cmpw %cr0,%r3,%r3 #; I think for a data dependency on the load
bne- %cr0,$+4 #; never-taken, if I'm reading this right?
isync #; instruction sync, blocking the front-end until older instructions retire?
blr
store(std::atomic<int>&):
li %r9,1
lwsync # light-weight sync = LoadLoad + StoreStore + LoadStore. (But not blocking StoreLoad)
stw %r9,0(%r3)
blr
I don't know if isync
is always cheaper than lwsync
which I'd think would also work there; I'd have thought stalling the front-end might be worse than imposing some ordering on loads and stores.
I suspect the reason for the compare-and-branch instead of just isync
(documentation) is that a load can retire from the back-end ("complete") once it's known to be non-faulting, before the data actually arrives.
(x86 doesn't do this, but weakly-ordered ISAs do; it's how you get LoadStore reordering on CPUs like ARM, with in-order or out-of-order exec. Retirement goes in program order, but stores can't commit to L1d cache until after they retire. x86 requiring loads to produce a value before they can retire is one way to guarantee LoadStore ordering. How is load->store reordering possible with in-order commit?)
So on PowerPC, the compare into condition-register 0 (%cr0
) has a data dependency on the load, can can't execute until the data arrives. Thus can't complete. I don't know why there's also an always-false branch on it. I think the $+4
branch destination is the isync
instruction, in case that matters. I wonder if the branch could be omitted if you only need LoadLoad, not LoadStore? Unlikely.
IDK if ARMv7 can maybe block just LoadLoad or StoreStore. If so, that would be a big win over dmb ish
, which compilers use because they also need to block LoadStore.
Loads cheaper than acquire: memory_order_consume
This is the useful hardware feature that ISO C++ doesn't currently expose (because std::memory_order_consume
is defined in a way that's too hard for compilers to implement correctly in every corner case, without introducing more barriers. Thus it's deprecated, and compilers handle it the same as acquire
).
Dependency ordering (on all CPUs except DEC Alpha) makes it safe to load a pointer and deref it without any barriers or special load instructions, and still see the pointed-to data if the writer used a release store.
If you want to do something cheaper than ISO C++ acq
/rel
, the load side is where the savings are on ISAs like POWER and ARMv7. (Not x86; full acquire is free). To a much lesser extent on ARMv8 I think, as ldapr
should be cheapish.
See C++11: the difference between memory_order_relaxed and memory_order_consume for more, including a talk from Paul McKenney about how Linux uses plain loads (effectively relaxed
) to make the read side of RCU very very cheap, with no barriers, as long as they're careful to not write code where the compiler can optimize away the data dependency into just a control dependency or nothing.
Also related: