I'm comparing TAS vs TTAS locking. Here is the code:
TAS:
.globl _tas_lock_acquire
_tas_lock_acquire:
repeat:
lock btsw $0, (%rdi)
jc repeat
ret
.globl _tas_lock_release
_tas_lock_release:
lock btrw $0, (%rdi)
ret
TTAS:
.globl _ttas_lock_acquire
_ttas_lock_acquire:
try_lock:
lock btsw $0, (%rdi)
jc spinwait
ret
spinwait:
btsw $0, (%rdi)
jc spinwait
jmp try_lock
.globl _ttas_lock_release
_ttas_lock_release:
btrw $0, (%rdi)
ret
If performance of TAS locking is similar to c++11 atomic_flag (no difference), then TTAS is significantly slower (order of 3 magnitude). I'm testing on "Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz".
What is my mistake that results in the slowness?