2

Hello,
Currently I am learning how to write multithreaded programs and I tried to implement spinlock in x86-64 assembly. It works by checking L0C_A1 local variable, if it is equal to 1 then thread can enter critical section, if it is equal to 0 then thread will wait in a loop. In critical section thread reads x variable, then squares it and then stores new value back. Execution is delayed by nanosleep system call. I face the following problem: only 1st thread executes critical section. However if I set breakpoint in gdb at line "movb $1,.L0C_A1(%rip) # leave critical section" and step through it and then continue execution then next thread will enter critical section. Why does it work this way? Is it related to cache memory?

Source code:

C code:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

#define MAX_THREADS 10

extern void* func(void*);

unsigned long int x = 3;

static pthread_t thr[MAX_THREADS];

int main(int argc, char *argv[])
{
    unsigned int i;
    for(i = 0; i < MAX_THREADS; ++i)
    {
        pthread_create(&thr[i], NULL, func, NULL);
    }
    for(i = 0; i < MAX_THREADS; ++i)
    {
        pthread_join(thr[i], NULL);
    }
    printf("%lu\n", x);
    exit(0);
}

Assembly code:

.globl func

.extern x

.section .text,"ax",@progbits

func:
    shrb    $1,.L0C_A1(%rip)  # check if critical section can be entered
    jnc     func
    movq    x(%rip),%rax      # load x

    xorl    %esi,%esi
    pushq   $10000000
    pushq   %rsi
    movq    %rsp,%rdi
    pushq   %rax
    leal    35(%rsi),%eax
    syscall                   # sleep 10 ms
    pop     %rax
    addq    $16,%rsp

    xorl    %edx,%edx
    mulq    %rax              # x = x * x
    movq    %rax,x(%rip)      # store new x
    movb    $1,.L0C_A1(%rip)  # leave critical section
    xorl    %eax,%eax
    ret

.section .data,"aw",@progbits

.L0C_A1: .byte 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
pajacol
  • 41
  • 3
  • You need a `lock` prefix for this to be safe on an SMP machine ([Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850)), and `lock shr` doesn't exist. (https://www.felixcloutier.com/x86/lock). – Peter Cordes Nov 19 '21 at 16:25
  • It is but I think that this is syntax for gnu assembler rip relative addressing – pajacol Nov 19 '21 at 16:26
  • Shift seems over-complicated as a locking scheme; the normal way is to `xchg` in a `1` and see if the old value was `0`. [Locks around memory manipulation via inline assembly](https://stackoverflow.com/a/37246263). The test code seems over-complicated, too: sleeping in the critical section reduces test rate. Squaring will overflow and wrap quickly, increment is good because you can see if it loses counts. Also, zeroing RDX before mul is useless, it's a write-only output. (which you aren't reading so you should just `imul %rax, %rax`) – Peter Cordes Nov 19 '21 at 16:28
  • Yes, RIP-relative addressing for static storage is the efficient way, you're doing that correctly. – Peter Cordes Nov 19 '21 at 16:29
  • Yeah, I think this locking scheme would work if you had an atomic SHR (0 >> 1 stays zero = locked, 1>>1 = 0 takes the lock.) The normal way is that 0 is unlocked, 1 is locked, but reversing it is fine. (Except then you can't put your lock in `.bss`, has to be non-zero initialized or it will start out locked.) So it's a duplicate of the canonical about the `lock` prefix and RMW atomicity. – Peter Cordes Nov 19 '21 at 16:31
  • 2
    @PeterCordes Changing `shrb $1,.L0C_A1(%rip)` to `lock btrl $0, .L0C_A1(%rip)` worked, thank you for solution – pajacol Nov 19 '21 at 16:40
  • @pajacol: In which case you should make sure to enlarge your `.L0C_A1` variable to 4 bytes, since `btrl` is a 32-bit instruction (note `l` suffix). And then you want to have your unlock do `movl $1, ...` instead. Also align `.L0C_A1` to 4 bytes by preceding it with `.balign 4`; unaligned `mov` is not atomic, and unaligned `lock xxx` is atomic but slow. It would be nice if you could keep your lock as a byte and use `lock btrb`, but unfortunately `btr` has no 8-bit version. – Nate Eldredge Nov 19 '21 at 19:07
  • 1
    (Also you might want to call your lock variable something more meaningful than `.L0C_A1`. Compilers choose arbitrary names for local symbols, but humans prefer human-readable names.) – Nate Eldredge Nov 19 '21 at 19:07
  • @NateEldredge: `lock btrl` atomically modifies any memory it touches. And the unlock is done with `movb` which is always atomic. The only concern is performance, whether `btrl` will address the aligned dword that contains the indexed bit, or whether it might address a dword starting at the address given in the addressing-mode. In which case yes, it could potentially split across a cache-line boundary. (Although I think it won't here, because each `.o` file's `.data` section should separately have at least 4-byte alignment by default, as set by `as`; `ld` will pad 0-3 bytes if necessary.) – Peter Cordes Nov 20 '21 at 04:00
  • @NateEldredge: Of course, `lock btrl` has few advantages over `lock cmpxchgb` or `xchgb`, other than maybe saving a tiny bit of code-size (mov-immediate to register) before it. But anyway, just aligning the lock byte itself to the start a dword should be fine to avoid any possibility of split locks; if the protected data is in the next 1 to 4 bytes that's totally fine. (Hmm, I should test with perf counters whether a btrl-immediate on a bit in a byte near the end of a cache line gives counts for `sq_misc.split_lock`. If anyone gets around to it, they could post a self-answered Q&A.) – Peter Cordes Nov 20 '21 at 04:08
  • Yeah, if it stays a byte but is aligned, I guess it is okay. If it's not aligned then my bigger concern is that the addressed dword could cross a *page* boundary and perhaps fault. – Nate Eldredge Nov 20 '21 at 04:14

0 Answers0