1

I'm comparing TAS vs TTAS locking. Here is the code:

TAS:

.globl _tas_lock_acquire
_tas_lock_acquire:
    repeat:
        lock btsw $0, (%rdi)
    jc repeat
    ret

.globl _tas_lock_release
_tas_lock_release:
    lock btrw $0, (%rdi)
    ret

TTAS:

.globl _ttas_lock_acquire
_ttas_lock_acquire:
    try_lock:
        lock btsw $0, (%rdi)
        jc spinwait
        ret
    spinwait:
        btsw $0, (%rdi)
        jc spinwait
        jmp try_lock

.globl _ttas_lock_release
_ttas_lock_release:
    btrw $0, (%rdi)
    ret

If performance of TAS locking is similar to c++11 atomic_flag (no difference), then TTAS is significantly slower (order of 3 magnitude). I'm testing on "Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz".

What is my mistake that results in the slowness?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The question of comparing arises from "The art of multiprocessor programming", chapter 7, comparing of TAS vs TTAS. –  May 03 '15 at 22:54
  • Unlock should be just a pure store. You do own the lock at that point so it's not a bug to do a non-atomic RMW on the whole 16-bit word containing the lock, but it's slower for no benefit. (And `lock btr` is even slower). If other threads could be doing anything with the other bits in that word, `btr` without `lock` could step on them. (Or actually the 32-bit dword that contains it because you used a bitstring instruction. 16-bit operand-size is normally the worst choice in 64-bit mode.) Anyway, just use `mov` to unlock. – Peter Cordes Oct 04 '21 at 15:46
  • Also, put a `pause` inside your spin loops, especially the one that (is supposed to) spin read-only. And if you aren't using other bits in the low byte for anything else, you don't need bit instructions, just `xchg`, like in [Locks around memory manipulation via inline assembly](https://stackoverflow.com/a/37246263) – Peter Cordes Oct 04 '21 at 15:47

1 Answers1

1

Oh, my mistake is in _ttas_lock_acquire implementation, spinwait label. It should be "bt" instruction, not "bts".