7

I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.

Here's my C++ code (note that 'bit' can be >= 32):

typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))

inline void SetBit(LongWord array[], const int bit)
{
   array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}

inline bool TestBit(const LongWord array[], const int bit)
{
    return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}

The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:

inline void SetBit(LongWord* array, const int bit)
{
   __asm {
      mov   eax, bit
      mov   ecx, array
      bts   [ecx], eax
   }
}

Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?

smartgo
  • 160
  • 2
  • 6

4 Answers4

8

BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.

Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).

This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.

Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:

...  ; inline asm
bt   reg, reg
setc al       ; end of inline asm
test al, al   ; compiler-generated
jz bit_was_zero

Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 4
    Actually, as of v6, gcc [can](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#FlagOutputOperands) return flags. – David Wohlferd Nov 25 '15 at 06:49
  • @DavidWohlferd: Thanks! That's interesting that they finally included that. – Peter Cordes Nov 25 '15 at 09:01
  • 1
    @PeterCordes have you ever seen a compiler generate `bts`? I could indeed get at least `clang` and `icc` to generate `bt` for "test bit" type [functions](https://godbolt.org/g/xx3v1x) but didn't have any luck with `bts` even though it would seem very useful for "set bit" type function compared to the alternatives of using `shl` or `shlx` and `or` and a couple extra ops to load the constant `1` etc. – BeeOnRope Mar 17 '17 at 00:50
  • 2
    @BeeOnRope: I forget. What you describe sounds familiar: using `bt` for testing bits, but failing to use `bts` for set-bit. I'm pretty sure it's a win on Intel CPUs at least, for stuff like `foo |= 1< – Peter Cordes Mar 17 '17 at 02:19
  • 1
    Yeah that was exactly the idom I tested in the godbolt link. I _did_ find a way to generate `bts` on `icc` only, in the specific case of a [compile-time bit index](https://godbolt.org/g/wuxMcm). Neither `gcc` nor `clang` use `bts` there, and `icc` doesn't use it if the position is variable (where it is arguably even more useful, since the non-`bts` code for variable shifts is generally even slower than the constant case, especially pre-`SHLX`). – BeeOnRope Mar 17 '17 at 02:23
  • @BeeOnRope: huh, interesting use-case. I guess `or r64, imm32` wouldn't work there, because the constant can't be represented as a sign-extended imm32. Agreed that using it for variable-shifts is a bigger win. – Peter Cordes Mar 17 '17 at 02:26
  • @BeeOnRope: re: lack of recent answers. Not exactly, but I was getting tired of seeing the same newbie questions all the time so I took a break for SO for a while to catch up on TV / movies / books I'd been wanting to watch/read. – Peter Cordes Mar 17 '17 at 02:28
  • @PeterCordes, duh, good point. I had only tested it with such a large constant due to copy-pasting from another test: but I hadn't explicitly intended to use 64-bit. Of course for 32-bit constants `or` is usually better due to more execution units and faster average speed across many architectures (`bts` hasn't been 1 cycle forever, AFAIK). Indeed, everyone uses `or` for 32-bit or less constants (you could still make a minor case for `bts` since it is 5 bytes vs `or`-with-32-bit-constant's 7 bytes). – BeeOnRope Mar 17 '17 at 02:35
5
inline void SetBit(*array, bit) {
    asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}
ephemient
  • 198,619
  • 38
  • 280
  • 391
  • Perfect, thanks. That helped me figure out my second question: inline bool TestBit(const LongWord array[], const int bit) { bool flag; asm("bt %2,%1; setb %0" : "=q" (flag) : "m" (*array), "r" (bit)); return flag; } – smartgo Dec 31 '09 at 02:41
  • @ephemient: Can you please add explanations to your answer. – arunmoezhi Apr 09 '14 at 21:09
  • 1
    If `bit` might be outside the 0..31 range, your memory operand should be the whole array, not just the first element of it. (Remember the crazy-CISC bit-string semantics of [`bts`](http://felixcloutier.com/x86/BTS.html) with a memory operand.) Also, you should let the source operand be `"ri"`, because it works with immediate operands (and is much less slow that way). But really you shouldn't use this at all on modern x86. `bts` with a memory operand is *slow* (see my answer). – Peter Cordes Jun 13 '18 at 09:31
  • This instruction mangles C flag, but the assembly doesn't specify that. – Maxim Egorushkin Jun 17 '20 at 15:14
  • 1
    @MaximEgorushkin: A "cc" clobber is implicit in GNU C inline asm for x86 and x86-64. That's not a bug, although some people consider it good style to manually specify it anyway. – Peter Cordes Jul 20 '21 at 08:23
1

This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:

int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
    int oldbit;
    asm("bts %2,%0"
        : "+r" (n), "=@ccc" (oldbit)
        : "r" (bit));
    return oldbit;
}

Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.

unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
  *(memoryaddress) = flags;
}

Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.

muusbolla
  • 637
  • 7
  • 20
  • Nice, thanks for writing this up. Ideally, GCC should use `bts reg,reg` itself when optimizing pure C, and it does in a few cases (like with non-pointers), but not with `mem[pos/32] |= 1<<(pos%32)` (https://godbolt.org/z/an6W1nqTq). It's had that missed optimization bug for years now; so does clang. Related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80813 re: using `bt` for read-only check of a bit-array. – Peter Cordes Mar 18 '22 at 02:45
-1

Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.

D.Shawley
  • 58,213
  • 10
  • 98
  • 113