2

I was trying to write a small wrapper function for the BTS instruction. So instead of the obvious:

bool bts( volatile uint32_t* dst, int idx ) {
  uint32_t mask = 1 << idx;
  bool ret = !!(dst & mask);
  dst |= mask;
  return ret;
}

I wrote this:

bool bts( volatile uint32_t* dst, int idx ) {
  bool ret;
  asm( "xor %1, %1\n\t"
       "bts %2, %0\n\t"
       "adc %1, %1"
       : "+m"(*dst), "=r"(ret) : "Ir"(idx) : "cc","memory" );
  return ret;
}

And this behaves ok when building optimized code, but when building non-optimized, it takes idx to always be 0. From the generated asm, it looks like it's not taking it from the rdx register but from the stack!

What am I doing wrong?

Jester
  • 56,577
  • 4
  • 81
  • 125
BitWhistler
  • 1,439
  • 8
  • 12
  • You might also consider using `setc` instead of `xor + adc`. And I don't believe the 'memory' clobber is required here. – David Wohlferd Jan 22 '16 at 23:06
  • There is no reason to `xor`, you should use a constraint so the compiler can take care of the zeroing. Incidentally that also fixes your problem. Alternatively, you can use `setc` or `sbb`. – Jester Jan 22 '16 at 23:26
  • 3
    I have tried `setc` and `sbb`, and they both work. I just wanted to understand what's wrong with my `adc` version:) – BitWhistler Jan 22 '16 at 23:34
  • See also http://stackoverflow.com/questions/34940356/atomic-test-and-set-in-x86-inline-asm-or-compiler-generated-lock-bts. If you want this to be atomic, you need to use `lock bts`. `volatile` doesn't do that. If not, you can use a `"+rm"` constraint. Anyway, have a look at my asm. It uses `setc` instead of `adc`, which is more efficient on pre-Broadwell Intel CPUs. – Peter Cordes Jan 23 '16 at 02:57
  • The nice thing about setc is that you don't have to zero the register first. And if you weren't zeroing the register first, you wouldn't have the conflict between registers that requires '&'. Which means you not only have fewer statements, you can use fewer registers (ok, 1 fewer). Also I notice that dst is volatile. Is this a multi-threading thing? If so, do you need to use the `lock` prefix? And while you probably aren't using gcc v6, if you were, you could even omit the setc and use the flags [directly](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#FlagOutputOperands). FWIW. – David Wohlferd Jan 23 '16 at 03:02

1 Answers1

4

Indeed GCC first copies all data to the stack. It is done for better debugging experience (when compiling with -O0, variables should not be "optimized out"). But it looks like the cause of the problem is that you early-clobber %1, i.e. GCC assumes that you will only write outputs after reading inputs and assigns shared registers for %1 and %2 (al and eax resp. for version which I tried).

You can use the "=&r" constraint instead of "=r" to prevent this from happening.

See also When to use earlyclobber constraint in extended GCC inline assembly?

Mikhail Maltsev
  • 1,632
  • 11
  • 21