I have read bunch of spinlock implementations for x86/amd64 architecture, including
glibc's pthread_spin_lock/pthread_spin_unlock
. Roughly speaking they use cmpxchg
instruction to acquire the lock, and just use regular MOV
instruction to release
the unlock. How come there is no need to flush store-buffer before the lock is released.
Consider following two threads running on different cores, and statement s100 runs immediately after s3.
thread 1:
s1: pthread_spin_lock(&mylock)
s2: x = 100
s3: pthread_spin_unlock() // call a function which contains "MOV mylock, 1"
thread 2:
s100: pthread_spin_lock(&mylock)
s200: assert(x == 100)
s300: pthread_spin_unlock(&mylock)
Is the s200 guaranteed true? Is it possible that by the time s100 acquire the lock, x's is still not yet flushed from store-buffer to cache?
I'm wondering:
- Is the call-overhead (of
pthread_spin_unlock()
) sufficient for covering the time of flushing store-buffer to cache? - Does the
cmpxchg
or any instruction with implicit or explicitLOCK
prefix magically flush store-buffers on other cores?
If the s200 is not guaranteed true, what is the most inexpensive way to fix it?
- insert
mfence
instruction prior to theMOV
instruction. - replace the
MOV
instruction with atomicfetch-and-and/or
instruction, - or others?
Profuse thanks in advance!