How come spin-unlock does not need to flush store buffer on x86/amd64

Question

I have read bunch of spinlock implementations for x86/amd64 architecture, including glibc's pthread_spin_lock/pthread_spin_unlock. Roughly speaking they use cmpxchg instruction to acquire the lock, and just use regular MOV instruction to release the unlock. How come there is no need to flush store-buffer before the lock is released.

Consider following two threads running on different cores, and statement s100 runs immediately after s3.

thread 1:                                                                       
s1: pthread_spin_lock(&mylock)                                                  
s2: x = 100                                                                     
s3: pthread_spin_unlock() // call a function which contains "MOV mylock, 1"

thread 2:                                                                       
s100: pthread_spin_lock(&mylock)                                                
s200: assert(x == 100)                                                          
s300: pthread_spin_unlock(&mylock)

Is the s200 guaranteed true? Is it possible that by the time s100 acquire the lock, x's is still not yet flushed from store-buffer to cache?

I'm wondering:

Is the call-overhead (of pthread_spin_unlock()) sufficient for covering the time of flushing store-buffer to cache?
Does the cmpxchg or any instruction with implicit or explicit LOCK prefix magically flush store-buffers on other cores?

If the s200 is not guaranteed true, what is the most inexpensive way to fix it?

insert mfence instruction prior to the MOV instruction.
replace the MOV instruction with atomic fetch-and-and/or instruction,
or others?

Profuse thanks in advance!

Synchronization only needs acquire/release ordering (which x86 does for free); full barrier on taking the lock is just a side effect of needing an atomic RMW on x86. `lock cmpxchg` is a full barrier; `cmpxchg` isn't (and isn't atomic wrt. other cores). So yes, if the writing thread gets the lock first, you're guaranteed that the assert won't fire. — Peter Cordes, Jun 22 '21 at 23:32
The key point here is that you don't need to make execution of code after the critical section wait for the stores to become visible, you just need to make sure that they're *ordered* properly, with the release store being last, whenever they do become visible via program-order commit of the store buffer. https://preshing.com/20120913/acquire-and-release-semantics. Or perhaps you didn't realize that the store buffer is always trying to drain itself as fast as possible (to make room for later stores), and barriers just make the current core wait. — Peter Cordes, Jun 22 '21 at 23:42

How come spin-unlock does not need to flush store buffer on x86/amd64

0 Answers0