5

I am using cmpxchg (compare-and-exchange) in i686 architecture for 32 bit compare and swap as follows.

(Editor's note: the original 32-bit example was buggy, but the question isn't about it. I believe this version is safe, and as a bonus compiles correctly for x86-64 as well. Also note that inline asm isn't needed or recommended for this; __atomic_compare_exchange_n or the older __sync_bool_compare_and_swap work for int32_t or int64_t on i486 and x86-64. But this question is about doing it with inline asm, in case you still want to.)

// note that this function doesn't return the updated oldVal
static int CAS(int *ptr, int oldVal, int newVal)
{
    unsigned char ret;
    __asm__ __volatile__ (
            "  lock\n"
            "  cmpxchgl %[newval], %[mem]\n"
            "  sete %0\n"
            : "=q" (ret), [mem] "+m" (*ptr), "+a" (oldVal)
            : [newval]"r" (newVal)
            : "memory");    // barrier for compiler reordering around this

    return ret;   // ZF result, 1 on success else 0
}

What is the equivalent for x86_64 architecture for 64 bit compare and swap

static int CAS(long *ptr, long oldVal, long newVal)
{
    unsigned char ret;
    // ?
    return ret;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • This has a bug: should be `"+a"(oldval)` because `cmpxchg` updates EAX if the compare fails and the store is not done. (I think we can skip an early-clobber `"+&a"` because the only thing to be written later is `ret`. We don't need to *read* the updated `oldVal` from EAX inside the asm, so if the compiler doesn't need the updated `oldVal`, it's fine if it allocates `ret` in `al`. (And in fact your function doesn't take `oldval` by reference. And BTW, yes this can break after inlining, even though a stand-alone version is safe because of the calling convention.) – Peter Cordes Apr 15 '18 at 04:29
  • Also, it falls off the end of a non-void function if `ret==1` (CAS succeeded). Just `return ret;` like a normal person. – Peter Cordes Apr 15 '18 at 04:31
  • 1
    Someone should make the obligatory reference to gcc's [built in atomic functions](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). There's no need to write this yourself, and plenty of reasons you [shouldn't](https://gcc.gnu.org/wiki/DontUseInlineAsm). – David Wohlferd Apr 15 '18 at 05:38
  • @DavidWohlferd: yup, was about to edit that in the question, but decided that would be too intrusive. [`__sync_bool_compare_and_swap`](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins), or a newer `__atomic_compare_exchange_n`, would solve the whole problem for 32 or 64-bit integers on i386 or x86-64, or whatever architecture you like! With the added benefit of avoiding a `sete` / `test` in a CAS loop, and avoiding an extra load because this crappy version doesn't update `oldVal` by reference. https://gcc.gnu.org/wiki/DontUseInlineAsm. – Peter Cordes Apr 15 '18 at 05:41
  • @DavidWohlferd: changed my mind, thought of better wording to editorialize and added that to the question. – Peter Cordes Apr 15 '18 at 05:48
  • @PeterCordes - Well, if we're updating this for future generations, shouldn't it use [flag outputs](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#FlagOutputOperands) instead of `sete`? Wonder what's up with the op's userid? I can't click on it? Maybe it's [this](https://stackoverflow.com/users/3871871/prabakaran) guy? – David Wohlferd Apr 15 '18 at 05:52
  • @DavidWohlferd: I was just looking for a non-terrible cmpxchg to link on just wanted to link on [c inline assembly getting "operand size mismatch" when using cmpxchg](//stackoverflow.com/q/49822854) instead of rewriting my own. Since inline asm is usually the wrong approach for this, I didn't put in the effort here. It's a totally trivial question; removing the `l` suffix will make the code work with 64-bit operand-size if args are changed to 64-bit (like normal for x86-64), and we can't rewrite it into a sensible question without invalidating answers. I'll prob. post a good version there. – Peter Cordes Apr 15 '18 at 05:57

4 Answers4

7

The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.

There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.

But, as stated, cmpxchgq is probably the better option for 64-bit code.


If you need to cmpxchg a 16 byte object, the 64-bit version of cmpxchg8b is cmpxchg16b. It was missing from the very earliest AMD64 CPUs, so compilers won't generate it for std::atomic::compare_exchange on 16B objects unless you enable -mcx16 (for gcc). Assemblers will assemble it, though, but beware that your binary won't run on the earliest K8 CPUs. (This only applies to cmpxchg16b, not to cmpxchg8b in 64-bit mode, or to cmpxchgq).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
2

cmpxchg8b

__forceinline int64_t interlockedCompareExchange(volatile int64_t & v,int64_t exValue,int64_t cmpValue)
{
  __asm {
    mov         esi,v
    mov         ebx,dword ptr exValue
    mov         ecx,dword ptr exValue + 4
    mov         eax,dword ptr cmpValue
    mov         edx,dword ptr cmpValue + 4
    lock cmpxchg8b qword ptr [esi]
  }
}
Shay Erlichmen
  • 31,691
  • 7
  • 68
  • 87
  • IIRC, cmpxchg8b dates all the way back to the first i486 processors, so compatibility is going to be less of an issue than with cpxchgq. – Brian Knoblauch Dec 01 '10 at 17:57
  • 1
    @Brian, I'm not overly convinced compatibility is a problem here since OP explicitly stated it was for x86_64. What I consider *more* of a potential issue is the "asm gymnastics" required to use `edx:eax/ecx:ebx` rather than the more natural (to me) `rax`. You should also ensure that your calling conventions re allowable register trashing permit the writes to those registers. Otherwise you'll need pushes and pops to protect them. – paxdiablo Oct 21 '16 at 01:38
  • @Shay: In x86-64, `int64_t` fits in `rax`, so you're returning the low half of the value from memory before the `cmpxchg8b`. If you intended this for 32-bit MSVC, rather than x86-64 clang for the x32 ABI (32-bit pointers in long mode), which could also compile this syntax, you should say so. Anyway, assuming you do return the full value, the caller would have to compare the return value with what it passed for `cmpValue` to see if the compare succeeded, I guess? Because you lose the flag result from `cmpxchg8b`. – Peter Cordes Apr 15 '18 at 05:12
  • Also, the var names seem bogus. Expected value and compare value both describe the `edx:eax` input, i.e. what you expect to find in memory. `ecx:ebx` is the new value (stored if compare succeeds), so IDK what `exValue` is supposed to stand for. – Peter Cordes Apr 15 '18 at 05:15
  • @PeterCordes I didn't write the code I used it from FFMPEG (I think, it was ALMOST 10 years ago) – Shay Erlichmen Apr 16 '18 at 12:51
1

The x64 architecture supports a 64-bit compare-exchange using the good, old cmpexch instruction. Or you could also use the somewhat more complicated cmpexch8b instruction (from the "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"):

The CMPXCHG instruction compares a value in the AL or rAX register with the first (destination) operand, and sets the arithmetic flags (ZF, OF, SF, AF, CF, PF) according to the result. If the compared values are equal, the source operand is loaded into the destination operand. If they are not equal, the first operand is loaded into the accumulator. CMPXCHG can be used to try to intercept a semaphore, i.e. test if its state is free, and if so, load a new value into the semaphore, making its state busy. The test and load are performed atomically, so that concurrent processes or threads which use the semaphore to access a shared object will not conflict.

The CMPXCHG8B instruction compares the 64-bit values in the EDX:EAX registers with a 64-bit memory location. If the values are equal, the zero flag (ZF) is set, and the ECX:EBX value is copied to the memory location. Otherwise, the ZF flag is cleared, and the memory value is copied to EDX:EAX.

The CMPXCHG16B instruction compares the 128-bit value in the RDX:RAX and RCX:RBX registers with a 128-bit memory location. If the values are equal, the zero flag (ZF) is set, and the RCX:RBX value is copied to the memory location. Otherwise, the ZF flag is cleared, and the memory value is copied to rDX:rAX.

Different assembler syntaxes may need to have the length of the operations specified in the instruction mnemonic if the size of the operands can't be inferred. This may be the case for GCC's inline assembler - I don't know.

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
-1

usage of cmpxchg8B from AMD64 Architecture Programmer's Manual V3:

Compare EDX:EAX register to 64-bit memory location. If equal, set the zero flag (ZF) to 1 and copy the ECX:EBX register to the memory location. Otherwise, copy the memory location to EDX:EAX and clear the zero flag.

I use cmpxchg8B to implement a simple mutex lock function in x86-64 machine. here is the code

.text
.align 8
.global mutex_lock
mutex_lock:
    pushq   %rbp
    movq    %rsp,   %rbp

    jmp .L1

.L1:
    movl    $0, %edx
    movl    $0, %eax
    movl    $0, %ecx
    movl    $1, %ebx
    lock    cmpxchg8B   (%rdi)
    jne .L1
    popq    %rbp
    ret
xinghua
  • 139
  • 2
  • 13
  • Downvoted for being super-overcomplicated as well as recommending a less-efficient instruction. `jmp .L1` is a no-op; execution always continues to the next instruction on its own. And `cmpxchg8b` is much more complicated to use than qword `cmpxchg`. e.g. `mutex_lock:` `xor %eax,%eax;` `mov $1,%edx;` `lock cmpxchg (%rdi);` `jne mutex_lock;` `ret` implements your whole function. (Feel free to replace your code with mine, I'd be happy to remove my downvote if you improve the answer.) – Peter Cordes Apr 15 '18 at 04:59
  • That's not a nice comment considering, mutex_lock: mov $1, %edx; .mutex: xor %eax, %eax; lock cmpxchg (%rdi); jne .mutex; ret; would 1+ that one even. Whether an example showing usage is a "Best Practice" example, or just a simple example to show a possibility, it's still an example. He didn't explicitly state he was giving "Best Practice" examples exactly. :o cmpxchg16b would probably require some kind of atomic string movement procedure to even find a purpose which would be hard to give best practice on anyway as there will probably always be better ways especially compared to cmpxchg8b. Idk – GodDamn Apr 16 '21 at 03:27
  • The main difference is the way it's called. cmpxchg{8,16}b has two operands while cmpxchg only has the one. The memory operand is involved in both instructions. While cmpxchg{8,16}b implicitly uses both {e,r}cx:{e,r}bx as the new replacement value and {e,r}dx:{e,r}ax as the old comparison value, cmpxchg on the other hand requires a second register operand to be explicitly stated as replacement value. There's tons of instructions that partly overlap though honestly, and are all just preference. AT&T is more complicated that Intel. So, why bash an answer of complications with more complications? – GodDamn Apr 16 '21 at 03:42
  • cmpxchg8|16}b can atomically load/store a 16 byte string in one operation while potentially being repeatedly done and that's all there is to it honestly... – GodDamn Apr 16 '21 at 04:05