Correct way to wrap CMPXCHG8B in GCC inline assembly, 32 bits

Question

I'm trying to write GCC inline asm for CMPXCHG8B for ia32. No, I cannot use __sync_bool_compare_and_swap. It has to work with and without -fPIC.

So far the best I've (EDIT: does not work after all, see my own answer below for details) is

register int32 ebx_val asm("ebx")= set & 0xFFFFFFFF;
asm ("lock; cmpxchg8b %0;"
     "setz %1;"
     : "+m" (*a), "=q" (ret), "+A" (*cmp)
     : "r" (ebx_val), "c" ((int32)(set >> 32))
     : "flags")

However I'm not sure if this is in fact correct.

I cannot do "b" ((int32)(set & 0xFFFFFFFF)) for ebx_val due to PIC, but apparently register asm("ebx") variable is accepted by the compiler.

BONUS: the ret variable is used for branching, so the code ends up looking like this:

cmpxchg8b [edi];
setz cl;
cmp cl, 0;
je foo;

Any idea how to describe output operands so that it becomes:

cmpxchg8b [edi]
jz foo

?

Thank you.

The fact that the compiler intrinsic doesn't work with -fPIC is just a flat-out compiler bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37651 It sucks when you have to work around a broken compiler, so you may want to put yourself on the CC list for that bug. — Crashworks, Jul 20 '11 at 04:30
@Gabe - It's most important when writing shared library objects. Ulrich Drepper has a good paper on the subject: http://www.akkadia.org/drepper/dsohowto.pdf — Crashworks, Jul 20 '11 at 04:37
@Crash: That's an interesting paper, but it does't seem to address PIC's relative advantages on IA32. I understand that it has advantages in terms of linking -- faster loads, less memory used -- but shared libraries are usually only loaded once per process and computers tend to have gigabytes of memory nowadays. I can see why you'd want to use PIC on x64 or IA64 where it's relatively cheap, but what about IA32 (x86)? — Gabe, Jul 20 '11 at 04:56
@Crashworks, I couldn't use the builtin even if that bug did not exist :( Have to support quite old compilers. — Laurynas Biveinis, Jul 20 '11 at 05:05
@Gabe, I'm working on a large 3rd party software project and I don't decide on the compilation options, that's a given. Also if that's useful, the code also has to work on Mac OS X and Solaris, could be something about these platforms. — Laurynas Biveinis, Jul 20 '11 at 05:08

score 2 · Answer 1 · answered Jul 20 '11 at 05:25

2

This is what I have:

bool
spin_lock(int64_t* lock, int64_t thread_id, int tries)
{
    register int32_t pic_hack asm("ebx") = thread_id & 0xffffffff;
retry:
    if (tries-- > 0) {
        asm goto ("lock cmpxchg8b %0; jnz %l[retry]"
                  :
                  : "m" (*lock), "A" ((int64_t) 0),
                    "c" ((int32_t) (thread_id >> 32)), "r" (pic_hack)
                  :
                  : retry);
        return true;
    }
    return false;
}

It uses the asm goto feature, new with gcc 4.5, that allows jumps from inline assembly into C labels. (Oh, I see your comment about having to support old versions of gcc. Oh well. I tried. :-P)

answered Jul 20 '11 at 05:25

C. K. Young

219,335
46
382
435

Thanks! I like that my code already is quite similar to yours, although of course I cannot use asm goto. Couple of questions though: 1) why *lock operand is input-only and not input/output? 2) why EFLAGS is not in the clobbered register list? – Laurynas Biveinis Jul 20 '11 at 05:36
@Laurynas: 1. `asm goto` cannot have any output constraints (current limitation that may be removed in later gcc versions); since I don't "care" about the current value of the lock (we're not trying to do recursive locking ;-)), that was acceptable. 2. Because the examples for `asm goto` don't have it either (and yes, it does conditional jumps too), so I presume that `asm goto` assumes a clobbered flags by default. – C. K. Young Jul 20 '11 at 05:38
For future readers: gcc's machine-definitions for x86 and x86-64 make every inline asm statement implicitly include a `"cc"` clobber. You never need to write one explicitly for x86 asm. – Peter Cordes Jun 11 '16 at 21:11
The asm statement does modify things, so it needs to use a `"memory"` clobber, and register clobbers on all regs that it can change. [The `asm goto` docs have an example of that](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#GotoLabels). – Peter Cordes Jun 11 '16 at 21:24
Also, for practical use, it's usually better to spin on a `pause`+ non-locked load in the retry case, instead of hammering away with `lock`ed instructions (causing contention that can delay the thread trying to unlock). I know you're only using `cmpxchg8b` for a spinlock as an example to demonstrate the `asm goto`, but using cmpxchg for a spinlock isn't needed. [A simple `xchg` is all you need.](http://stackoverflow.com/questions/37241553/locks-around-memory-manipulation-via-inline-assembly/37246263#37246263) – Peter Cordes Jun 11 '16 at 21:46

user786653 · Accepted Answer · 2011-07-21T13:14:17.920

How about the following, which seems to work for me in a small test:

int sbcas(uint64_t* ptr, uint64_t oldval, uint64_t newval)
{
    int changed = 0;
    __asm__ (
        "push %%ebx\n\t" // -fPIC uses ebx, so save it
        "mov %5, %%ebx\n\t" // load ebx with needed value
        "lock\n\t"
        "cmpxchg8b %0\n\t" // perform CAS operation
        "setz %%al\n\t" // eax potentially modified anyway
        "movzx %%al, %1\n\t" // store result of comparison in 'changed'
        "pop %%ebx\n\t" // restore ebx
        : "+m" (*ptr), "=r" (changed)
        : "d" ((uint32_t)(oldval >> 32)), "a" ((uint32_t)(oldval & 0xffffffff)), "c" ((uint32_t)(newval >> 32)), "r" ((uint32_t)(newval & 0xffffffff))
        : "flags", "memory"
        );
    return changed;
}

If this also gets miscompiled could you please include a small snippet that triggers this behavior?

Regarding the bonus question I don't think it is possible to branch after the assembler block using the condition code from the cmpxchg8b instruction (unless you use the asm goto or similar functionality). From GNU C Language Extensions:

It is a natural idea to look for a way to give access to the condition code left by the assembler instruction. However, when we attempted to implement this, we found no way to make it work reliably. The problem is that output operands might need reloading, which would result in additional following "store" instructions. On most machines, these instructions would alter the condition code before there was time to test it. This problem doesn't arise for ordinary "test" and "compare" instructions because they don't have any output operands.

EDIT: I Can't find any source that specifies one way or the other whether it is OK to modify the stack while also using the %N input values (This ancient link says "You can even push your registers onto the stack, use them, and put them back." but the example doesn't have input).

But it should be possible to do without by fixing the values to other registers:

int sbcas(uint64_t* ptr, uint64_t oldval, uint64_t newval)
{
    int changed = 0;
    __asm__ (
        "push %%ebx\n\t" // -fPIC uses ebx
        "mov %%edi, %%ebx\n\t" // load ebx with needed value
        "lock\n\t"
        "cmpxchg8b (%%esi)\n\t"
        "setz %%al\n\t" // eax potentially modified anyway
        "movzx %%al, %1\n\t"
        "pop %%ebx\n\t"
        : "+S" (ptr), "=a" (changed)
        : "0" (ptr), "d" ((uint32_t)(oldval >> 32)), "a" ((uint32_t)(oldval & 0xffffffff)), "c" ((uint32_t)(newval >> 32)), "D" ((uint32_t)(newval & 0xffffffff))
        : "flags", "memory"
        );
    return changed;
}

Thank you! One issue I can see is same as with my snippet: the compiler might address %0 through ESP and cannot tell that ESP has changed by push/pop. Also, thank you for the info re. condition codes in output, it has confirmed what I had suspected. — Laurynas Biveinis, Jul 21 '11 at 12:39
I don't think I've ever seen GCC do that though. It (always?) does it through `ebp` as far as I have seen. I'll see if I can dig up a reference on whether this is always guaranteed (or if it can be made so). — user786653, Jul 21 '11 at 12:43
@Laurynas Biveinis: I've updated my answer with a version, that should avoid that problem all together. — user786653, Jul 21 '11 at 13:14
Why there is the ""0" (ptr)" operand? With all the values fixed to the registers sometimes I got GCC errors about failing reloads to satisfy asm constraints. I'll give this version a spin and will report how it goes — Laurynas Biveinis, Jul 22 '11 at 02:52
@Laurynas: It's a [Matching(digit) constraint](http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html#ss6.1) to make sure ptr is treated as both input and output. I'm not sure it's strictly necessary together with the '+' constraint, but I figured it couldn't hurt to be more explicit. — user786653, Jul 22 '11 at 08:29
I cannot get your code to compile, probably I'm doing something stupid. The movzx instruction is missing the width suffixes. I have tried movzxbl, but then it got "instantiated" as movzxbl %al, %al, which is both wrong and redundant. — Laurynas Biveinis, Jul 26 '11 at 12:34
You `changed` variable needs to be a 32-bit int. You should be able to change that line to `movzx %%al, %%eax` and have it work. (Or change the type of `changed`). Are you trying to use the function in some other context as a stand-alone function, because that might change things. — user786653, Jul 26 '11 at 12:38
This code does not preserve the old value (which cmpxchg8b loads into EDX:EAX if the values are non-equal). I get a feeling that after modifying (it is necessary: yours is function, mine is macro that gets expanded in several other macros/functions) this code it will become quite close to mine in the other answer, with the main difference that I load the values directly and your code loads through the pointer and then clobbers memory. — Laurynas Biveinis, Jul 30 '11 at 09:11

Laurynas Biveinis · Answer 3 · 2011-07-22T02:48:32.923

Amazingly enough, the code fragment in the question still gets miscompiled in some circumstances: if the zero-th asm operand is indirectly addressable through EBX (PIC) before the EBX register is set up with register asm, then gcc proceeds to load the operand through EBX after it's assigned to set & 0xFFFFFFFF!

This is the code I am trying to make work now: (EDIT: avoid push/pop)

asm ("movl %%edi, -4(%%esp);"
     "leal %0, %%edi;" 
     "xchgl %%ebx, %%esi;"
     "lock; cmpxchg8b (%%edi);" // Sets ZF
     "movl %%esi, %%ebx;"       // Preserves ZF
     "movl -4(%%esp), %%edi;"   // Preserves ZF
     "setz %1;"                 // Reads ZF
     : "+m" (*a), "=q" (ret), "+A" (*cmp)
     : "S" ((int32)(set & 0xFFFFFFFF)), "c" ((int32)(set >> 32))
     : "flags")

The idea here is to load the operands before clobbering the EBX, also avoid any indirect addressing while setting EBX value for CMPXCHG8B. I fix the hard register ESI for the lower half of operand, because if I didn't, GCC would feel free to reuse any other already taken register if it could prove that the value was equal. The EDI register is saved manually, as simply adding it to the clobbered register list chokes GCC with "impossible reloads", probably due to high register pressure. The PUSH/POP is avoided in saving EDI, as other operands might be ESP-addressed.

Correct way to wrap CMPXCHG8B in GCC inline assembly, 32 bits

3 Answers3

Linked