To start with, consider release semantics. If a data set is protected with a spinlock (mutex, etc. - no matters what exact implementation is used; for now, assume 0 means it's free and 1 - busy). After changing of the data set, a thread stores 0 to spinlock address. To force visibility of all previous actions before storing 0 to spinlock address, storing is executed with release semantics, that means all previous reads and writes shall be made visible to other threads before this storing. It is implementation detail whether this is done with full barrier, or release mark of the single store operation. That is (I hope) clear without any doubt.
Then, consider them moment when spinlock ownership is being taken. To protect against race, this is any kind of compare-and-set operation. With single-instruction CAS implementation (X86, Sparc...), this is combined reading and writing. The same for X86 atomic XCHG. With LL/SC (most RISCs), this falls to:
- Read (LL) the spinlock location until it shows free state. (Can be optimized with a kind of CPU stall.)
- Write (SC) the value "occupied" (1 in our case). CPU exposes whether the operation was successful (condition flag, output register, etc.)
- Check the write (SC) result and, if failed, go to step 1.
In all cases, the operation that shall be visible to other threads to show that spinlock is occupied, is writing of 1 to its location, and barrier shall be committed between this writing and following manipulations on the data set protected with the spinlock. Reading of this spinlock gives nothing to protection scheme, except permit of CAS or LL/SC operation.
But all really implemented schemes allow acquire semantics modification on reads (or CAS), not writes. As result, LL/SC scheme would require additional final read-with-acquire operation on the spinlock to commit the required barrier. But there is no such instruction in typical output. For example, if compile on ARM:
for(;;) {
int e{0};
int d{1};
if (std::atomic_compare_exchange_weak_explicit(p, &e, d,
std::memory_order_acquire,
std::memory_order_relaxed)) {
return;
}
}
its output contains first LDAXR == LL+acquire, then STXR == SC (without barrier in it, so, there is no guarantee other threads will see it?) This is likely not my artifact but is generated e.g. in glibc: pthread_spin_trylock
calls __atomic_compare_exchange_weak_acquire
(and no more barriers), that falls into GCC builtin __atomic_compare_exchange_n
with acquire on mutex reading and no release on mutex writing.
It seems Iʼve missed some principal detail in this consideration. Would anybody correct it?
This also could fall into 2 sub-questions:
SQ1: In instruction sequence like:
(1) load_linked+acquire mutex_address ; found it is free
(2) store_conditional mutex_address ; succeeded
(3) read or write of mutex-protected area
what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?
SQ2: Is there a design factor that suggests having acquire semantics only on loads?
I've seen that some examples of lock-free code, such as:
thread 1:
var = value;
flag.store(true, std::memory_order_release);
thread 2:
if (flag.load(std::memory_order_acquire)) {
// We already can access it!!!
value = var;
... do something with value ...
}
but this should have been made working after the mutex-protected style gets working stably.