How to write the btr instruction with 'asm volatile'

Question

What is a correct implementation that uses btr to reset a bit in an integer in C++ with the extended assembly (asm volatile) syntax? I need to return the value in the bit before the reset.

This is the implementation I have, is this correct for a 16 bit integer?

std::uint16_t reset(std::uint16_t& integer, std::uint32_t bit) {
  auto success = false;
  asm volatile("lock btrw %1, (%2); setnc %0"
               : "=r"(success)
               : "i"(bit), "r"(&integer)
               : "memory", "flags");
  return !success;
}

Is this implementation correct? Have I missed any detail? I am not very familiar with the asm() syntax or with x86 assembly.

Regarding the close vote, I do not believe I am asking why my code is misbehaving, but rather if I have missed any detail that might make this code behave badly in unforeseen circumstances. StackOverflow is the only source I have for reliable information in this area. The x86 docs don't translate well to the syntax here. — Curious, Aug 28 '18 at 23:08
Purely as a matter of interest, what software product are you creating that requires these many (in your case) abstruse operations? And I haven't voted to close or downvote you here, I'm genuinely interested. — , Aug 28 '18 at 23:17
I'd personally recommend not using inline asm even for this. I believe compilers have intrinsics you can call to perform this operation. [MSVC](https://learn.microsoft.com/en-gb/cpp/intrinsics/bittestandreset-bittestandreset64?view=vs-2017), — Justin, Aug 28 '18 at 23:23
`"r"(&integer)` should probably be made into an output constraint. `"=m"(integer)` . that will also capture the fact that you are modifying the value at that address.You can then remove the parentheses around `%2` Your compiler should support the intrinsic for this already. — Michael Petch, Aug 28 '18 at 23:26
Instead of `"r" (&integer)` you should use `"m" (integer)`. Also it should be a read-write operand and you should drop the `memory` constraint. Also note that the `i` constraint is compile time so it will not work if the value is only known at runtime (but there is a `btr r/m16, r16` you could use). Finally, `success` is a misleading name (and has misleading type) and also you could just spare the negation if you used `setc`. — Jester, Aug 28 '18 at 23:28
@Jester could you formalize that with full asm() blocks? I don't think I am able to follow your answer without looking at the whole expansion you have in mind :( — Curious, Aug 28 '18 at 23:41
@NeilButterworth At the moment, I'm just experimenting with different optimizations, and trying to see which instruction has the least overhead. — Curious, Aug 28 '18 at 23:42
@NeilButterworth there is a synchronization library that uses `fetch_and()` instruction. I am just experimenting with this to see if it has any effect — Curious, Aug 28 '18 at 23:45
For what purpose? What is the point in programming if you don't produce something useful? I suppose you see programming something like a crossword puzzle? — , Aug 28 '18 at 23:47
BTR with a memory operand is very likely to be slower than either BTR with a register operand or the equivalent C code. https://stackoverflow.com/questions/2039861/how-to-get-gcc-to-generate-bts-instruction-for-x86-64-from-standard-c#comment88671043_2392292 — Ross Ridge, Aug 28 '18 at 23:53
@NeilButterworth I am legitimately trying to optimize the synchronization library. How do you assume that I am trying to do something that is not useful here? — Curious, Aug 28 '18 at 23:57
Because you don't say what any of this, including this very vague library, is going to be used for. — , Aug 28 '18 at 23:58
@NeilButterworth I mentioned it was for a synchronization library, I am afraid I can't tell more than that in a public forum at the moment :/ — Curious, Aug 28 '18 at 23:59
If it's for synchronization across threads in a multiple CPU environment then you left out LOCK prefix that would make it atomic. — Ross Ridge, Aug 29 '18 at 00:01
@RossRidge only did so because I thought it was not relevant, I'll add it back. My understanding was that adding the lock prefix would make it atomic and would impose acquire-release ordering automatically — Curious, Aug 29 '18 at 00:01
The overhead is going to be dominated by the LOCK prefix, so your choice of instruction isn't going to matter. Though BTR could only make it worse as I mentioned above. — Ross Ridge, Aug 29 '18 at 00:10

Jester · Answer 1 · 2018-08-29T00:08:23.333

4

Here is a version that replaces the memory clobber with a proper read-write operand instead of passing in the address in a register and also gets rid of the setnc (requires gcc 6+). Added r to cover the case where bit is not known at compilation time. Changed success to more readable was_set with bool type. Note if you want this to be atomic you also need to add a lock prefix. For compiler memory barrier you might need to put the memory constraint back.

bool reset(std::uint16_t& integer, std::uint32_t bit) {
  bool was_set;
  asm volatile("btrw %w2, %1"
               : "=@ccc"(was_set), "+mr"(integer)
               : "ri"(bit)
               : "cc");
  return was_set;
}

The atomic version could look like:

bool reset(std::uint16_t& integer, std::uint32_t bit) {
  bool was_set;
  asm volatile("lock btrw %w2, %1"
               : "=@ccc"(was_set), "+m"(integer)
               : "ri"(bit)
               : "cc", "memory");
  return was_set;
}

edited Aug 29 '18 at 00:08

answered Aug 28 '18 at 23:52

Jester

56,577
4
81
125

Thanks, could you clarify what difference the memory constraint makes here? I do intend to add a lock prefix to this. I updated the question accordingly. What ordering does `memory` impose (as defined by `std::memory_order`)? – Curious Aug 29 '18 at 00:02
Note that `lock` does not work with register operands so you might want to change the `+mr` to just `m` then. The `memory` constraint would force the compiler to flush and reload any other things that may be held in registers. – Jester Aug 29 '18 at 00:06
So the `memory` constraint is the same as `std::atomic_thread_fence(std::memory_order_seq_cst)` (`MFENCE`)? – Curious Aug 29 '18 at 00:09
2

@Curious It's a compiler barrier. It has no effect on the the CPU reordering of memory accesses, just the compiler's reordering. – Ross Ridge Aug 29 '18 at 00:12
@Curious: The `lock` prefix makes it a full barrier for *runtime* memory reordering, like `mfence`. So with that *and* a barrier against compile-time reordering, you have both effects of `std::atomic_thread_fence(std::memory_order_seq_cst)`. BTW, using inline asm for this probably has no advantage over pure C++ atomic `fetch_and`, which can compile to `lock btrw` given a single-bit mask. (I think gcc and clang know that optimization). C++20 `atomic_ref` will make it official that you can do atomic stuff on a normal variable, but until then I guess just cast. – Peter Cordes Aug 29 '18 at 04:51
This seems to return the opposite value? When I do reset(0b0, 0) I get true, when I should be getting false. Are you sure this is correct? – Curious Aug 29 '18 at 18:19
I don't think this is correct, it produces bad output for me. I get `56` when I try `reset(0, 0);` Could you update the answer with the correct version? – Curious Aug 29 '18 at 18:30
You get `56` for what? `reset(0, 0)` should not even compile. `reset(x, 0)` with `x=0` works for me (but with inline asm that doesn't say much) Also all other test cases I tried work. – Jester Aug 29 '18 at 18:47
@Jester yes i mean with an integer lvalue argument. I don't get the same value :( is there a platform with minimum compatibility here or something? – Curious Aug 29 '18 at 20:40
@Jester This seems to work on my platform but does not in godbolt https://gcc.godbolt.org/z/MUs81E. Any idea why? – Curious Aug 29 '18 at 21:18
1

`i` only works for compile time constants. You can use `ri` as I have. – Jester Aug 29 '18 at 21:25
@Jester Does it make a runtime difference when compiled btw? – Curious Aug 29 '18 at 21:32
@Jester Why does this not work either? https://gcc.godbolt.org/z/QKrYf2 prepending the `r` before the `i` because of compile time constants didn't seem to help in clang.. – Curious Sep 05 '18 at 01:16
Because you used a 32 bit register. Cast `bit` to `uint64_t` or use `%q1`. – Jester Sep 05 '18 at 12:45

How to write the btr instruction with 'asm volatile'

1 Answers1