2

This question:

How do you set, clear, and toggle a single bit?

discusses three operations on a specific bit within a larger value, which correspond in a straightforward manner to OR 1, AND 0, and XOR 1 on that bit. But what if we don't know that second bit's value in advance? What if we want perform an assignment, where we don't know the other operand bit at compile-time? That is, we want set one bit, at a certain position a bit block, to the value of a new unrelated bit provided at run-time?

I'd like to have the fastest possible implementation for this on, say, Intel x86_64 (and ignoring vectorization, which I hope is irrelevant here). Also, suppose for simplicity the bit block type is uint32_t.

Edit: Made my answer C rather than C++, since there's nothing C++'ish about the question really.

einpoklum
  • 118,144
  • 57
  • 340
  • 684

1 Answers1

2

Here are two possible implementations, for the case of a 32-bit block type:

#include <cstdint>

uint32_t bit_assign_v1(uint32_t block, uint8_t bit_index, bool x)
{
    uint32_t mask = uint32_t { 1 } << bit_index;
    return (block & ~mask) | (((uint32_t) x) << bit_index);
}

uint32_t bit_assign_v2(uint32_t block, uint8_t bit_index, bool x)
{
    uint32_t mask = uint32_t { 1 } << bit_index;
    return x ? (block & ~mask) : (block | mask);
}

Using GodBolt, I get differently-optimized code for each of these two options, which also differs as we change the platforms, and compilers. Here's an example for Skylake (or better yet, look at this version, which is the same code but split into more C statements so you can better associate the assembly with the C code).

GCC 8.2 assembly:

bit_assign_1:
        movzx   eax, sil
        btr     edi, eax
        movzx   edx, dl
        shlx    eax, edx, eax
        or      eax, edi
        ret
bit_assign_2:
        mov     ecx, 1
        shlx    esi, ecx, esi
        andn    eax, esi, edi
        or      esi, edi
        test    dl, dl
        cmove   eax, esi
        ret

clang 7.0 assembly:

bit_assign_1:                           # @bit_assign_1
        btr     edi, esi
        shlx    eax, edx, esi
        or      eax, edi
        ret
bit_assign_2:                           # @bit_assign_2
        mov     eax, edi
        btr     eax, esi
        bts     edi, esi
        test    edx, edx
        cmovne  edi, eax
        mov     eax, edi
        ret

I haven't benchmarked any of this yet.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • This is not an assignment though – Slava Oct 03 '18 at 17:27
  • 1
    The first asm looks better for throughput on Intel: only 5 uops, vs. 6 (or 7 on Haswell and earlier where cmov is 2). But on Ryzen, `btr` costs 2 uops. (The `movzx eax, sil` is redundant, though. `btr` masks the count so `btr edi, esi` would ignore the low bits. Hopefully that disappears when inlining, though. In the 2nd version, the `mov ecx,1` could be hoisted out of a loop, and the `test dl,dl` could disappear (flags set by whatever computed the result). – Peter Cordes Oct 03 '18 at 17:27
  • I wonder if you could get gcc or clang to emit `btr`, `mov+bts`, and `test`/`cmov` to select the reset or set version? The choice between these versions might depend on which of the 3 inputs is on the critical path for latency. The first version has 3c latency from `bool x` being ready to the result being ready, but the 2nd only has 2c latency there (with single-uop `cmov`.) The first would also be 2 uops if gcc knew to use a different register for `movzx` to allow mov-elimination, though :/ But if movzx optimizes away on inlining... – Peter Cordes Oct 03 '18 at 17:32
  • @Slava: Well, these functions return the result of the assignment, so it's good enough... – einpoklum Oct 03 '18 at 17:37
  • @PeterCordes: About your second comment - see edit for what clang does bt default. I'm not very fluent in assembly (e.g. I need to read up on `btr`). I will say though that, in a loop, you very often need to assign to _many_ bits of the same byte, and if you know what the pattern of assignment positions are you may be able to do something smarter still. – einpoklum Oct 03 '18 at 17:45
  • `btr` Resets (clears) the destination bit selected by the source index. The manual for `bt`/`btr`/`bts`/`btc` http://felixcloutier.com/x86/BTR.html isn't 100% clear that the shift-count is masked for register counts. But it is, and I think gcc knows this. Or maybe just clang does. (I've seen one of them optimize away the AND in `x & (1<< (n&31))`). – Peter Cordes Oct 03 '18 at 17:53
  • Note that the clang version is dependent on the unwritten "guarantee" that narrow inputs are zero- or sign-extended to 32 bit by the caller. That's why it can leave out both `movzx` instructions. [Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI?](https://stackoverflow.com/a/36760539). Neat, clang uses my other idea :P But yeah, only 3 uops is very nice. I don't think you can do better than that, and unfortunately x86 doesn't have a `btx` exchange instruction that copies CF *into* to destination, only unconditional set, reset, and complement. – Peter Cordes Oct 03 '18 at 17:56
  • @PeterCordes: So, just so I get this straight - clang is relying on the top bytes of `%esi` being 0, even though the ABI doesn't guarantee that they are? Cuh-razeee! – einpoklum Oct 03 '18 at 19:05
  • 1
    Yup, read the linked answer. Both gcc and clang do satisfy that not-yet-standardized behaviour when emitting code that calls functions, but only clang relies on it in the callee. I forget if ICC can safely call clang code or not, but gcc and clang can call each other. Hand-written asm calling clang code could also be a problem. – Peter Cordes Oct 03 '18 at 19:09