c inline assembly getting "operand size mismatch" when using cmpxchg

Question

I'm trying to use cmpxchg with inline assembly through c. This is my code:

static inline int
cas(volatile void* addr, int expected, int newval) {
    int ret;
    asm volatile("movl %2 , %%eax\n\t"
                "lock; cmpxchg %0, %3\n\t"
                "pushfl\n\t"
                "popl %1\n\t"
                "and $0x0040, %1\n\t"
                : "+m" (*(int*)addr), "=r" (ret)
                : "r" (expected), "r" (newval)
                : "%eax"
                );
    return ret;
}

This is my first time using inline and i'm not sure what could be causing this problem. I tried "cmpxchgl" as well, but still nothing. Also tried removing the lock. I get "operand size mismatch". I think maybe it has something to do with the casting i do to addr, but i'm unsure. I try and exchange int for int, so don't really understand why there would be a size mismatch. This is using AT&T style. Thanks

Trying to use inline asm is almost always a [bad idea](https://gcc.gnu.org/wiki/DontUseInlineAsm). Have you considered using intrinsics or libraries? — David Wohlferd, Apr 13 '18 at 21:41
You might also look at https://stackoverflow.com/a/37825052/2189500 — David Wohlferd, Apr 13 '18 at 21:48
I know you don't get to control what assignments your teacher gives you, but inline asm seems like a poor use of student's time. It's not portable between compilers, it's not portable between platforms, it's *extremely* hard to get right, even when it "runs" you might still have obscure problems that bite you later. It also messes with compiler optimizations, sacrifices chances for constant usage. And when you are done, the only skill you have acquired is learning how this particular incarnation of inline asm works, something you should work hard to avoid ever using in production code. — David Wohlferd, Apr 13 '18 at 23:08

Peter Cordes · Answer 1 · 2020-05-11T14:18:38.507

As @prl points out, you reversed the operands, putting them in Intel order (See Intel's manual entry for cmpxchg). Any time your inline asm doesn't assemble, you should look at the asm the compiler was feeding to the assembler to see what happened to your template. In your case, simply remove the static inline so the compiler will make a stand-alone definition, then you get (on the Godbolt compiler explorer):

 # gcc -S output for the original, with cmpxchg operands backwards
    movl %edx , %eax
    lock; cmpxchg (%ecx), %ebx        # error on this line from the assembler
    pushfl
    popl %edx
    and $0x0040, %edx

Sometimes that will clue your eye / brain in cases where staring at %3 and %0 didn't, especially after you check the instruction-set reference manual entry for cmpxchg and see that the memory operand is the destination (Intel-syntax first operand, AT&T syntax last operand).

This makes sense because the explicit register operand is only ever a source, while EAX and the memory operand are both read and then one or the other is written depending on the success of the compare. (And semantically you use cmpxchg as a conditional store to a memory destination.)

You're discarding the load result from the cas-failure case. I can't think of any use-cases for cmpxchg where doing a separate load of the atomic value would be incorrect, rather than just inefficient, but the usual semantics for a CAS function is that oldval is taken by reference and updated on failure. (At least that's how C++11 std::atomic and C11 stdatomic do it with bool atomic_compare_exchange_weak( volatile A *obj, C* expected, C desired );.)

(The weak/strong thing allows better code-gen for CAS retry-loops on targets that use LL/SC, where spurious failure is possible due to an interrupt or being rewritten with the same value. x86's lock cmpxchg is "strong")

Actually, GCC's legacy __sync builtins provide 2 separate CAS functions: one that returns the old value, and one that returns a bool. Both take the old/new value by reference. So it's not the same API that C++11 uses, but apparently it isn't so horrible that nobody used it.

Your overcomplicated code isn't portable to x86-64. From your use of popl, I assume you developed it on x86-32. You don't need pushf/pop to get ZF as an integer; that's what setcc is for. cmpxchg example for 64 bit integer has a 32-bit example that works that way (to show what they want a 64-bit version of).

Or even better, use GCC6 flag-return syntax so using this in a loop can compile to a cmpxchg / jne loop instead of cmpxchg / setz %al / test %al,%al / jnz.

We can fix all of those problems and improve the register allocation as well. (If the first or last instruction of an inline-asm statement is mov, you're probably using constraints inefficiently.)

Of course, by far the best thing for real usage would be to use C11 stdatomic or a GCC builtin. https://gcc.gnu.org/wiki/DontUseInlineAsm in cases where the compiler can emit just as good (or better) asm from code it "understands", because inline asm constrains the compiler. It's also difficult to write correctly / efficient, and to maintain.

Portable to i386 and x86-64, AT&T or Intel syntax, and works for any integer type width of register width or smaller:

// Note: oldVal by reference
static inline char CAS_flagout(int *ptr, int *poldVal, int newVal)
{
    char ret;
    __asm__ __volatile__ (
            "  lock; cmpxchg  {%[newval], %[mem] | %[mem], %[newval]}\n"
            : "=@ccz" (ret), [mem] "+m" (*ptr), "+a" (*poldVal)
            : [newval]"r" (newVal)
            : "memory");    // barrier for compiler reordering around this

    return ret;   // ZF result, 1 on success else 0
}


// spinning read-only is much better (with _mm_pause in the retry loop)
// not hammering on the cache line with lock cmpxchg.
// This is over-simplified so the asm is super-simple.
void cas_retry(int *lock) {
    int oldval = 0;
    while(!CAS_flagout(lock, &oldval, 1)) oldval = 0;
}

The { foo,bar | bar,foo } is ASM dialect alternatives. For x86, it's {AT&T | Intel}. The %[newval] is a named operand constraint; it's another way to keep your operands . The "=ccz" takes the z condition code as the output value, like a setz.

Compiles on Godbolt to this asm for 32-bit x86 with AT&T output:

cas_retry:
    pushl   %ebx
    movl    8(%esp), %edx      # load the pointer arg.
    movl    $1, %ecx
    xorl    %ebx, %ebx
.L2:
    movl    %ebx, %eax          # xor %eax,%eax would save a lot of insns
      lock; cmpxchg  %ecx, (%edx) 

    jne     .L2
    popl    %ebx
    ret

gcc is dumb and stores a 0 in one reg before copying it to eax, instead of re-zeroing eax inside the loop. This is why it needs to save/restore EBX at all. It's the same asm we get from avoiding inline-asm, though (from x86 spinlock using cmpxchg):

// also omits _mm_pause and read-only retry, see the linked question
void spin_lock_oversimplified(int *p) {
    while(!__sync_bool_compare_and_swap(p, 0, 1));
}

Someone should teach gcc that Intel CPUs can materialize a 0 more cheaply with xor-zeroing than they can copy it with mov, especially on Sandybridge (xor-zeroing elimination but no mov-elimination).

Once upon a time, you were supposed to use the `pause` instruction while waiting for memory to get updated by another thread. Is that no longer correct? Or were you just simplifying? — David Wohlferd, Apr 15 '18 at 07:10
@DavidWohlferd: That's what `_oversimplified` is for in the function names :P — Peter Cordes, Apr 15 '18 at 07:17
@DavidWohlferd: since you were wondering, I'm sure others might as well. Made it explicit what's being oversimplified. — Peter Cordes, Apr 15 '18 at 07:20
wow thanks for a great post. I will read it and change it accordingly. (: — Ace66, Apr 17 '18 at 10:50
I wish I could give this multiple upvotes. I take it an appropriately modified version of this is probably the only way in C11 to achieve an atomic_compare_exchange_strong_no_lock_prefix. — Petr Skocik, Feb 07 '22 at 13:26
@PSkocik: Well, AFAIK you can't get GCC or clang to emit asm without `lock` prefixes, but there is an [assembler option](https://sourceware.org/binutils/docs/as/i386_002dOptions.html#i386_002dOptions) `-momit-lock-prefix=yes` which will make GAS emit machine code without lock prefixes even if the compiler emitted them. This is only safe for single-threaded programs, or on uniprocessor machines, of course. `gcc -Wa,-momit-lock-prefix=yes` passes that on to the assembler. — Peter Cordes, Feb 07 '22 at 13:32
AFAIK there isn't a GCC (or any other x86 compiler) option proper to tell it to compile that way, or any builtin for doing a non-atomic cmpxchg (or one that's atomic only wrt. interrupts / signal handlers in the current core / thread) — Peter Cordes, Feb 07 '22 at 13:33
@PeterCordes I like how this inline assembly can be easily turned into a macro that behaves exactly like `atomic_compare_exchange_strong` as far as codegen is concerned. Now I'm curious if the same could be done for other RMW atomic ops, e.g., atomic_fetch_add. There it seems harder/impossible because gcc/clang output either `locked xadd memory, register` or `locked add memory, register` depending on whether the return value is used or not. — Petr Skocik, Feb 09 '22 at 05:19
@PSkocik: I don't see an obvious way to do that optimization for inline asm. Note that it's even more important for ALU ops other than add/sub (`neg`/`xadd`), e.g. `fetch_or` needs a CAS retry loop if the old result is needed. (Unless the operand only has a single bit set, in which case you could `bts`, but that's likely only worth looking for `if(__builtin_constant_p(x))`. — Peter Cordes, Feb 09 '22 at 05:57
One last question, now more directly relevant to your answer: while clang-trunk accepts it, older clangs don't seem to like the intel-syntax portion of the template and they complain with "unknown use of instruction mnemonic without a size suffix" https://gcc.godbolt.org/z/jfTa11Td1. Any idea why and can it be made to work even there? — Petr Skocik, Feb 09 '22 at 16:52
@PSkocik: clang-trunk accepts Intel-syntax inline asm now? That's cool, will have to update the clang parts of [How to set gcc to use intel syntax permanently?](https://stackoverflow.com/q/38953951) - I'd never previously found any way to get clang to use Intel syntax. Note that the clang error you're seeing with older clang is using AT&T operand names with Intel-syntax operand order from that part of the template: `cmpxchg (%rdi), %edx` which has the operands backwards. — Peter Cordes, Feb 09 '22 at 16:55

score 2 · Accepted Answer · edited Apr 15 '18 at 03:20

You had the operand order for the cmpxchg instruction is reversed. AT&T syntax needs the memory destination last:

    "lock; cmpxchg %3, %0\n\t"

Or you could compile that instruction with its original order using -masm=intel, but the rest of your code is AT&T syntax and ordering so that's not the right answer.

As far as why it says "operand size mismatch", I can only say that that appears to be an assembler bug, in that it uses the wrong message.

c inline assembly getting "operand size mismatch" when using cmpxchg

2 Answers2

Linked