Unlike was said there, I don't think there is any reason for a try_lock
function to failed due to OS-related reasons: such operation is non-blocking, so signals cannot really interrupt it. Most likely it has everything to do with how this function is implemented on CPU-level. After all, uncontested case is usually the most interesting one for a mutex.
Mutex locking usually requires some form of atomic compare exchange
operation. C++11 and C11 introduce atomic_compare_exchange_strong
and atomic_compare_exchange_weak
. The latter is allowed to fail spuriously.
By allowing try_lock
to fail spuriously, implementations are allowed to use atomic_compare_exchange_weak
to maximize performance and minimize code size.
For example on ARM64 atomic operations are usually implemented using exclusive-load (LDXR
) and exclusive-store (STRX
) instructions. LDXR
fires up "monitor" hardware which starts tracking all accesses to a memory region. STRX
only performs the store if no accesses to that region were made between LDXR
and STRX
instructions. Thus the whole sequence can fail spuriously if another thread accesses that memory region or if there was an IRQ in between the two.
In practice, code generate for try_lock
implemented using weak guarantee is not very different from the one implemented using strong guarantee.
bool mutex_trylock_weak(atomic_int *mtx)
{
int old = 0;
return atomic_compare_exchange_weak(mtx, &old, 1);
}
bool mutex_trylock_strong(atomic_int *mtx)
{
int old = 0;
return atomic_compare_exchange_strong(mtx, &old, 1);
}
Take a look at generated assembly for ARM64:
mutex_trylock_weak:
sub sp, sp, #16
mov w1, 0
str wzr, [sp, 12]
ldaxr w3, [x0] ; exclusive load (acquire)
cmp w3, w1
bne .L3
mov w2, 1
stlxr w4, w2, [x0] ; exclusive store (release)
cmp w4, 0 ; the only difference is here
.L3:
cset w0, eq
add sp, sp, 16
ret
mutex_trylock_strong:
sub sp, sp, #16
mov w1, 0
mov w2, 1
str wzr, [sp, 12]
.L8:
ldaxr w3, [x0] ; exclusive load (acquire)
cmp w3, w1
bne .L9
stlxr w4, w2, [x0] ; exclusive store (release)
cbnz w4, .L8 ; the only difference is here
.L9:
cset w0, eq
add sp, sp, 16
ret
The only difference is that "weak" version eliminates conditional backward branch cbnz w4, .L8
and replaces it with cmp w4, 0
. Backward condition branches are predicted by CPU as "will-be-taken" in the absense of branch prediction information as they are assumed to be part of a loop - such assumption is wrong in this case as most of the time lock will be acquired (low contention is assumed to be the most common case).
Imo this is the only performance difference between those functions. "Strong" version can basically suffer from 100% branch misprediction ratio on single instruction under some workloads.
By the way, ARMv8.1 introduces atomic instructions, so there is no difference between the two, just like on x86_64. Code generated with -march=armv8.1-a
flag:
sub sp, sp, #16
mov w1, 0
mov w2, 1
mov w3, w1
str wzr, [sp, 12]
casal w3, w2, [x0]
cmp w3, w1
cset w0, eq
add sp, sp, 16
ret
Some try_lock
functions can fail even when atomic_compare_exchange_strong
is used, for example try_lock_shared
of shared_mutex
might need to increment reader counter and might fail if another reader has entered the lock. "Strong" variant of such function would need to generate a loop and thus can suffer from a similar branch mispredication.
Another minor detail: if mutex is written in C, some compilers (like Clang) might align loop at 16-byte boundary to improve its performance, bloating function body with padding. This is unnecessary if loop almost always runs a single time.
Another reason for spurious failure is a failure to acquire internal mutex lock (if mutex is implemented using a spinlock and some kernel primitive). In theory the same principle could be acquired in the kernel implementation of try_lock
, although this does not seems reasonable.