Atomic test-and-set in x86: inline asm or compiler-generated lock bts?

Question

The below code when compiled for a xeon phi throws Error: cmovc is not supported on k1om.

But it does compile properly for a regular xeon processor.

#include<stdio.h>
int main()
{
    int in=5;
    int bit=1;
    int x=0, y=1;
    int& inRef = in;
    printf("in=%d\n",in);
    asm("lock bts %2,%0\ncmovc %3,%1" : "+m" (inRef), "+r"(y) : "r" (bit), "r"(x));
    printf("in=%d\n",in);
}

Compiler - icc (ICC) 13.1.0 20130121

Related question: bit test and set (BTS) on a tbb atomic variable

IIRC, first-gen Xeon Phi is based on P5 cores (Pentium, and Pentium MMX). `cmov` wasn't introduced until P6 (aka Pentium Pro). So I think this is normal. Just let the compiler do its job by writing a normal ternary operator. Also note that `bts` with a memory operand is super-slow. Don't do that. Also, I think you're testing bits in the address of `in`, since you're asking for `&in` to be stored in memory. Is that what you wanted? — Peter Cordes, Jan 22 '16 at 07:24
i assumed as xeon phi is also x86 based, it should work fine — arunmoezhi, Jan 22 '16 at 07:29
@PeterCordes: Yes `bts` is slow. I'm using this in an multi-threaded code. Trying to set a particular bit of an integer. The variable `y` will be `1` if the bts succeeded in setting the bit. It will be `0` if the bit is already set. — arunmoezhi, Jan 22 '16 at 07:32
You're not using `lock bts`, so there's no point in using it with a memory operand. It's still a separate read-modify-write. Use `std::atomic`'s `fetch_or`. http://en.cppreference.com/w/cpp/atomic/atomic/fetch_or. (Also, my comment about modifying the address was wrong. It's a reference, not a pointer. I don't see the purpose of using a reference, but `"+m"` for it should still just be the same memory where `in` is stored. — Peter Cordes, Jan 22 '16 at 07:34
ahh.. you are right. When I stripped down the code to create a MWE I missed that. — arunmoezhi, Jan 22 '16 at 07:35

score 3 · Accepted Answer · edited May 23 '17 at 11:52

IIRC, first-gen Xeon Phi is based on P5 cores (Pentium, and Pentium MMX). cmov wasn't introduced until P6 (aka Pentium Pro). So I think this is normal.

Just let the compiler do its job by writing a normal ternary operator.

Second, cmov is a far worse choice for this than setc, since you want to produce a 0 or 1 based on the carry flag. See my asm code below.

Also note that bts with a memory operand is super-slow, so you don't want it to generate that code anyway, esp. on a CPU that decodes x86 instructions into uops (like a modern Xeon). According to http://agner.org/optimize/, bts m, r is much slower than bts m, i even on P5, so don't do that.

Just ask the compiler for in to be in a register, or better yet, just don't use inline asm for this.

Since the OP apparently wants this to work atomically, the best solution is to use C++11's std::atomic::fetch_or, and leave it up to the compiler to generate lock bts.

std::atomic_flag has a test_and_set function, but IDK if there a way to pack them tightly. Maybe as bitfields in a struct? Unlikely though. I also don't see atomic operations for std::bitset.

Unfortunately, current versions of gcc and clang don't generate lock bts from fetch_or, even when the much-faster immediate-operand form is usable. I came up with the following (godbolt link):

#include <atomic>
#include <stdio.h>

// wastes instructions when the return value isn't used.
// gcc 6.0 has syntax for using flags as output operands

// IDK if lock BTS is better than lock cmpxchg.
// However, gcc doesn't use lock BTS even with -Os
int atomic_bts_asm(std::atomic<unsigned> *x, int bit) {
  int retval = 0;  // the compiler still provides a zeroed reg as input even if retval isn't used after the asm :/
  // Letting the compiler do the xor means we can use a m constraint, in case this is inlined where we're storing to already zeroed memory
  // It unfortunately doesn't help for overwriting a value that's already known to be 0 or 1.
  asm( // "xor      %[rv], %[rv]\n\t"
       "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"  // hope that the compiler zeroed with xor to avoid a partial-register stall
        : [x] "+m" (*x), [rv] "+rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}

// save an insn when retval isn't used, but still doesn't avoid the setc
// leads to the less-efficient setc/ movzbl sequence when the result is needed :/
int atomic_bts_asm2(std::atomic<unsigned> *x, int bit) {
  uint8_t retval;
  asm( "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"
        : [x] "+m" (*x), [rv] "=rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}


int atomic_bts(std::atomic<unsigned> *x, unsigned int bit) {
  // bit &= 31; // stops gcc from using shlx?
  unsigned bitmask = 1<<bit;
  //int oldval = x->fetch_or(bitmask, std::memory_order_relaxed);

  int oldval = x->fetch_or(bitmask, std::memory_order_acq_rel);
  // acquire and release semantics are free on x86
  // Also, any atomic rmw needs a lock prefix, which is a full memory barrier (seq_cst) anyway.

  if (oldval & bitmask)
    return 1;
  else
    return 0;
}

As discussed in What is the best way to set a register to zero in x86 assembly: xor, mov or and?, xor / set-flags / setc is the optimal sequence for all modern CPUs when the result is needed as a 0-or-1 value. I haven't actually considered P5 for that, but setcc is fast on P5 so it should be fine.

Of course, if you want to branch on this instead of storing it, the boundary between inline asm and C is an obstacle. Spending two instructions to store a 0 or 1, only to test/branch on it, would be pretty dumb.

gcc6's flag-operand syntax would certainly be worth looking in to, if it's an option. (Probably not if you need a compiler that targets Intel MIC.)

what is the difference between `bts m,r` and `bts m,i`? Does the `i` here means immediate addressing mode? — arunmoezhi, Jan 22 '16 at 07:48
@arunmoezhi: yes, `i` is an immediate. If the bit-position is a compile-time constant, you will get much better results that way, since the CPU knows the final memory address much earlier in the pipeline, so it decodes to a lot fewer uops. Use a `ri` constraint, for example. I'm not having much luck getting gcc to generate a `lock bts` even when it might be optimal, though. http://goo.gl/yKYTfY. http://en.cppreference.com/w/cpp/atomic/atomic_flag looks useful (has an atomic test_and_set function), but only for stand-alone flags, not for bits in an int :/ — Peter Cordes, Jan 22 '16 at 07:58
Thanks. In the `atomic_bts` function, the assembly code has `cmpxchg`. But in a multithreaded environment, compare_and_exchange might be slower than `bts`. For example, two threads `A` and `B` tries to `simultaneously` set bits `bitA` and `bitB` respectively. If I use `cmpxchg` then only one will succeed. But `bts` on both the bits will succeed. — arunmoezhi, Jan 22 '16 at 08:04
`atomic_flag` looks like a good alternative. But I cannot use it, as it is a single flag. I use an integer as an array of bits which can be locked separetely. `atomic_flag` will give me only one lock. but an integer will give be atleast 32 bits each of which can be used as a lock — arunmoezhi, Jan 22 '16 at 08:07
@arunmoezhi: so you've created a single word with 32 lock bits. If you really need that many locks, this is going to be a single point of very high contention even if threads want different locks. I'd think you'd want to create an array of individual locks each in their own cache line so that contenders for a particular "bit" only contend for that lock's cache line. [[IIRC, Intel has some funny rules about how far apart locks should be to minimize bus locking, it might be 128bytes, 2-4 cache lines, or some such. I suggest you consult the Intel optimization manuals.] — Ira Baxter, Jan 22 '16 at 08:26
@IraBaxter: valid point. But in my use case, I have millions of items each having a single word with 32 lock bits. For space constraints I do not want to create an array of integers. — arunmoezhi, Jan 22 '16 at 08:29
@arunmoezhi: Surely each of those locks is protecting some piece of data (otherwise, why do you need them?) You could simply associate a lock with the piece of data, and cache-align the data-with-lock. In this case, you can still use a single bit for the lock and abuse the rest of the word for something else. (Whoever owns the data can do anything it likes to that word as long as it leaves the locked-bit set). — Ira Baxter, Jan 22 '16 at 08:31
@arunmoezhi: If the lock bits *are* the data, can you maybe divide ownership of them at a larger scale, so a thread has exclusive access to a range of the bitmap without needing to use locks or atomic operations? Anyway, updated my post with a good asm implementation of what you were trying to do (which doesn't use `cmov` anyway), and a hopefully good std::atomic version that generates not-terrible code. You didn't look closely enough at the code it generates, though: It loops on `lock cmpxchg` until it succeeds, like is normally necessary when using that insn. It's ugly :/ — Peter Cordes, Jan 22 '16 at 08:40

Atomic test-and-set in x86: inline asm or compiler-generated lock bts?

1 Answers1

Linked