0

I am trying to write a simple compare and swap inline assembly code. Here is my code

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
static inline unsigned long
cas(volatile unsigned long* ptr, unsigned long old, unsigned long _new)
{
    unsigned long prev=0;
    asm volatile("lock cmpxchg8b %0;"
                 : "=m"(prev)
                 : "m"(*ptr),"a"(old),"c"(_new)
                 );
    return prev;
}

int main()
{

    unsigned long *a;
    unsigned long b=5,c;
    a=&b;
        c=cas(a,b,6);
    printf("%lu\n",c);
    return 0;
}

This code should ideally print 5 but it is printing 0. What is wrong in my code ?Please help.

Ritesh
  • 67
  • 8
  • 1
    Does http://stackoverflow.com/questions/6756985/correct-way-to-wrap-cmpxchg8b-in-gcc-inline-assembly-32-bits help? – user200783 Jun 14 '16 at 20:42
  • 1
    Have you read the docs for cmpxchg8b: `Compare EDX:EAX with m64. If equal, set ZF and load ECX:EBX into m64. Else, clear ZF and load m64 into EDX:EAX.` Since you aren't loading any particular value into EDX (or EBX), I assume the compare always fails, meaning the asm does nothing, and `prev` (init to 0 in unoptimized builds) is returned unchanged. Additionally, the memory address you are passing to cmpxchg8b is `prev` (aka %0), not ptr, so ptr is never used. Which is probably just as well since *ptr (vs ptr) is probably not a valid memory address. – David Wohlferd Jun 14 '16 at 22:26
  • Also, how long are `unsigned long` on your platform (you say x86)? If the answer is not 8 bytes, you need to reconsider using cmpxchg8b. What is wrong? I'm afraid it's nearly everything. – David Wohlferd Jun 14 '16 at 22:28

2 Answers2

4

Let me start by saying "Using inline asm is a bad idea." And let me repeat that "Using inline asm is a bad idea." You could write an entire wiki entry about why using inline asm is a bad idea. Please consider using builtins (like gcc's __sync_bool_compare_and_swap) or libraries like <atomic> instead.

If you are writing production software, the risks from using inline asm are almost certainly greater than any benefit. If you are writing for educational purposes, then read on.

(For a further illustration of why you shouldn't use inline asm, wait for Michael or Peter to show up and point out all the things wrong with this code. It's really hard, even for people who know this stuff, to get it right.)

Here is some code showing how to use cmpxchg8b. It is simple, but should be sufficient to give a general idea.

#include <stdio.h>

// Simple struct to break up the 8 byte value into 32bit chunks.
typedef union {
  struct {
     unsigned int lower;
     unsigned int upper;
  };
  unsigned long long int f;
} moo;

unsigned char cas(moo *ptr, moo *oldval, const moo *newval)
{
   unsigned char result;

#ifndef __GCC_ASM_FLAG_OUTPUTS__

   asm ("lock cmpxchg8b %[ptr]\n\t"
        "setz %[result]"
        : [result] "=q" (result), [ptr] "+m" (*ptr),
          "+d" (oldval->upper), "+a" (oldval->lower)
        : "c" (newval->upper), "b" (newval->lower)
        : "cc", "memory");

#else

   asm ("lock cmpxchg8b %[ptr]"
        : [result] "=@ccz" (result), [ptr] "+m" (*ptr),
          "+d" (oldval->upper), "+a" (oldval->lower)
        : "c" (newval->upper), "b" (newval->lower)
        : "memory");

#endif

   return result;
}

int main()
{
   moo oldval, newval, curval;
   unsigned char ret;

   // Will not change 'curval' since 'oldval' doesn't match.
   curval.f = -1;
   oldval.f = 0;
   newval.f = 1;

   printf("If curval(%u:%u) == oldval(%u:%u) "
          "then write newval(%u:%u)\n",
          curval.upper, curval.lower,
          oldval.upper, oldval.lower,
          newval.upper, newval.lower);

   ret = cas(&curval, &oldval, &newval);

   if (ret)
      printf("Replace succeeded: curval(%u:%u)\n",
             curval.upper, curval.lower);
   else
      printf("Replace failed because curval(%u:%u) "
             "needed to be (%u:%u) (which cas has placed in oldval).\n",
             curval.upper, curval.lower,
             oldval.upper, oldval.lower);

   printf("\n");

   // Now that 'curval' equals 'oldval', newval will get written.
   curval.lower = 1234; curval.upper = 4321;
   oldval.lower = 1234; oldval.upper = 4321;
   newval.f = 1;

   printf("If curval(%u:%u) == oldval(%u:%u) "
          "then write newval(%u:%u)\n",
          curval.upper, curval.lower,
          oldval.upper, oldval.lower,
          newval.upper, newval.lower);

   ret = cas(&curval, &oldval, &newval);

   if (ret)
      printf("Replace succeeded: curval(%u:%u)\n",
             curval.upper, curval.lower);
   else
      printf("Replace failed because curval(%u:%u) "
             "needed to be (%u:%u) (which cas has placed in oldval).\n",
             curval.upper, curval.lower,
             oldval.upper, oldval.lower);

}

A few points:

  • If the cas fails (because the values don't match), the return value from the function is 0, and the value you need to use is returned in oldval. This makes trying again simple. Note that if you are running multi-threaded (which you must be or you wouldn't be using lock cmpxchg8b), a second attempt could conceivable fail as well, since the 'other' thread could have beaten you to the write again.
  • The __GCC_ASM_FLAG_OUTPUTS__ define is available on newer builds of gcc (6.x+). It allows you to skip doing the setz and use the flags directly. See the gcc docs for details.

As for how it works:

When we call cmpxchg8b, we pass it a pointer to memory. It is going to compare the (8 byte) value that is in that memory location to the 8 bytes in edx:eax. If they match, then it will write the 8 bytes in ecx:ebx to the memory location and the zero flag will be set. If they don't match, then the current value will be returned in edx:eax and the zero flag will be cleared.

So, compare that with the code:

   asm ("lock cmpxchg8b %[ptr]"

Here we are passing the pointer to the 8 bytes to cmpxchg8b.

        "setz %[result]"

Here we are storing the contents of the zero flag set by cmpxchg8b into (result).

        : [result] "=q" (result), [ptr] "+m" (*ptr),

Specify that (result) is an output (=), and that it must be a byte register (q). Also, the memory pointer is an in+out (+) since we will be both reading it and writing to it.

          "+d" (oldval->upper), "+a"(oldval->lower)

The + signs again indicate that these values are in+out. This is necessary since if the comparison fails, edx:eax will be overwritten with the current value from ptr.

        : "c" (newval->upper), "b"(newval->lower)

These values are input-only. The cmpxchg8b isn't going to change their values so we put them after the second colon.

        : "cc", "memory");

Since we are changing the flags, we need to inform the compiler via "cc". The "memory" constraint may not be necessary, depending on exactly what cas is being used for. It's possible that thread 1 is notifying thread 2 that something is ready for processing. In that case, you want to make absolutely sure gcc doesn't have any values in registers that it is planning to write to memory later. It absolutely must flush them all to memory before executing the cmpxchg8b.

The gcc docs describe in detail the workings of the extended asm statement. If parts of this explanation are still unclear, some reading might help.

BTW in case I forgot to mention, writing inline asm is a bad idea...

David Wohlferd
  • 7,110
  • 2
  • 29
  • 56
  • 1
    hehe, I think I'll leave this one alone; looks like you have it covered. Despite my avatar, I do sometimes need to restrain myself from trying to correct *everything* that's wrong on the Internet. Although, are you sure you need a `"memory"` clobber? If you use a `"+m"` operand for the value in memory that might or might not be modified, the compiler will have to reload it. Probably using a `volatile moo*` is a good choice. – Peter Cordes Jun 15 '16 at 02:37
  • Also, you could avoid the `union` by using a `"+A"` constraint, meaning the combination of `edx:eax` as a 64bit value. But it is clearer, and portable to x86-64 ([where a 64bit value will just go into either rax or rdx](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html)). – Peter Cordes Jun 15 '16 at 02:43
  • @MichaelPetch: Hey, I linked to both my wiki and my docs. Wouldn't want to blow my horn *too* much. – David Wohlferd Jun 15 '16 at 02:53
  • @PeterCordes: Consider code where thread 1 is handing work off to a worker thread. It populates a struct and uses cas to add it to a linked list. You MUST ensure that the struct is flushed BEFORE adding the node to the list (ie memory clobber), since the 'other' thread could start accessing it the instant the cmpxchg8b is complete. The clobber isn't needed for the moo ptr, it's needed for everything else. – David Wohlferd Jun 15 '16 at 02:55
  • 1
    @PeterCordes: Just couldn't help yourself, eh? I'm not a big fan of `+A`. For teaching purposes (like the current context), I just can't see it. As for `volatile`, hmm. Given the "+m", I don't know what more it could buy you. From a self-documenting standpoint, maybe. Could the compiler need this for some reason? – David Wohlferd Jun 15 '16 at 03:09
  • @DavidWohlferd: oh right, to get release semantics with a compiler memory barrier. I guess it's a good idea to keep things simple by making it a CAS with sequential-consistency semantics, rather than relaxed, since that would probably lead someone to include a separate `mfence` instead of just using a compiler memory barrier. (There's no standard way to express the fact that it's a full memory barrier at the asm level, but relaxed as far as compile-time reordering, so only a compiler barrier is needed to strengthen it). – Peter Cordes Jun 15 '16 at 05:51
  • 1
    I was thinking of `volatile` just as a self-documenting thing, and to prevent any overly-clever compiler from loading/storing to it outside the asm. Besides, if you need to spin on it to wait for some condition before cmpxchg8b, you should do that [with normal loads (and a `pause` instruction)](http://stackoverflow.com/a/37246263/224132). 8-byte atomic loads/stores are possible in 32bit mode using SSE `movq`, or `fild`/`fistp`, and gcc will actually use that. You only need cmpxchg8b for an atomic RMW. – Peter Cordes Jun 15 '16 at 05:55
  • Since all these recent cmpxchg?b questions seem to refer to `node_lf`, adding the memory clobber seemed prudent. You've got a good point on volatile. Doesn't seem worth editing for just that, but if something else comes up, I'll make sure to add it. OP hasn't accepted the answer yet, so there may be other changes. – David Wohlferd Jun 15 '16 at 06:20
2

Sorry for not answering your question directly, but my question is: why not use C11's <stdatomic.h> or C++11's <atomic>? It's a lot less error-prone than writing your own functions and has the advantage that you're not targeting a specific hardware architecture or compiler.

In your case you should either be using atomic_compare_exchange_weak() or atomic_compare_exchange_strong().

Ed Schouten
  • 500
  • 2
  • 5
  • These two functions you have mentioned ,they return boolean values. For my implementation I need old value pointed *ptr. – Ritesh Jun 14 '16 at 20:42
  • These functions can also be used for that purpose. The variable pointed to by the second argument will be overwritten to contain the old value. – Ed Schouten Jun 14 '16 at 20:45
  • 1
    @Ritesh: Please edit your question and add why you cannot use standard features of the language and have to revert to Assembler code. In general it is a bad idea to write your own stubs, as they might interfere with compiler optimisation and result in worse code. Not to mention it is not portable. There are application notes to be found how to emulate CAS with stdatomics. But actually, the standard provides more useful functions which make that CAS-emulation likely superfluous. – too honest for this site Jun 14 '16 at 21:19