what is the most efficient way to flip all the bits from the least significant bit up to the most significant last 1 bit value?

Question

Say for example I have a uint8_t that can be of any value, and I only want to flip all the bits from the least significant bit up to the most significant last 1 bit value? How would I do that in the most efficient way?, Is there a solution where I can avoid using a loop?

here are some cases:

left side is the original bits - right side after the flips.

00011101 -> 00000010
00000000 -> 00000000
11111111 -> 00000000
11110111 -> 00001000
01000000 -> 00111111

[EDIT]

The type could also be larger than uint8_t, It could be uint32_t, uint64_t and __uint128_t. I just use uint8_t because it's the easiest size to show in the example cases.

Define "efficient". What measure? `uint8_t` is only 256 values so you could have a lookup table. — kaylum, Apr 26 '22 at 05:27
I just use ```uint8_t``` as an example because it's the easiest to represent but the number could also be ```uint32_t```, ```uint64_t``` and even ```__uint128_t``` — 0xdeadbeef, Apr 26 '22 at 05:29
@PaulR for now I'm currently working on an x86_64 so maybe x86_64. — 0xdeadbeef, Apr 26 '22 at 05:32
If you have a quick way to [count leading zeros](https://stackoverflow.com/a/673781/3386109), the rest is easy. — user3386109, Apr 26 '22 at 05:46
@user3386109 yes but the simple solution of `x ^ (ones >> lzcnt(x))` doesn't work (dies if `x = 0`) so there is something interesting about the rest as well — harold, Apr 26 '22 at 05:59
@harold The `x == 0` case can be handled with a conditional operator `return (x==0) ? 0 : (ones >> lzcnt(x));` which gcc can compile to a conditional move on x86. — user3386109, Apr 26 '22 at 06:05

score 6 · Accepted Answer · edited Apr 28 '22 at 02:31

In general I expect that most solutions will have roughly this form:

Compute the mask of bits that need to flipped
XOR by that mask

As mentioned in the comments, x64 is a target of interest, and on x64 you can do step 1 like this:

Find the 1-based position p of the most significant 1, by leading zeroes (_lzcnt_u64) and subtracting that from 64 (or 32 whichever is appropriate).
Create a mask with p consecutive set bits starting from the least significant bit, probably using _bzhi_u64.

There are some variations, such as using BitScanReverse to find the most significant 1 (but it has an ugly case for zero), or using a shift instead of bzhi (but it has an ugly case for 64). lzcnt and bzhi is a good combination with no ugly cases. bzhi requires BMI2 (Intel Haswell or newer, AMD Zen or newer).

Putting it together:

x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))

Which could be further simplified to

_bzhi_u64(~x,  64 - _lzcnt_u64(x))

As shown by Peter. This doesn't follow the original 2-step plan, rather all bits are flipped, and then the bits that were originally leading zeroes are reset.

Since those original leading zeroes form a contiguous sequence of leading ones in ~x, an alternative to bzhi could be to add the appropriate power of two to ~x (though sometimes zero, which might be thought of as 2⁶⁴, putting the set bit just beyond the top of the number). Unfortunately the power of two that we need is a bit annoying to compute, at least I could not come up with a good way to do it, it seems like a dead end to me.

Step 1 could also be implemented in a generic way (no special operations) using a few shifts and bitwise ORs, like this:

// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
//   even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;  // last step should be removed if x is 32-bit

AMD CPUs have slowish BSR (but fast LZCNT; https://uops.info/), so you might want this shift/or version for uint8_t or uint16_t (where it takes fewest steps), especially if you need compatibility with all CPUs and speed on AMD is more important than on Intel.

This generic version is also useful within SIMD elements, especially narrow ones, where we don't have a leading-zero-count until AVX-512.

It seems that I don't have the ```_bzhi_u64``` instruction, though the alternative you provided worked. — 0xdeadbeef, Apr 26 '22 at 06:31
@kabibesadagat you probably shouldn't use that "generic" way unless you really have to (6 of those shift-OR steps isn't great, not horrible either, but not great), you can still use a shift to replace the BZHI as long as you deal with the edge case — harold, Apr 26 '22 at 06:35
@kabibesadagat: You never need the "generic" version of the `m = x | (x>>1)` stuff on an x86 CPU. You always have at least `__builtin_clzll` or equivalent, which can at worst compile to a BSR instruction, so you need to special-case zero. Or in 32-bit mode, two BSR instructions on the halves, and other checks. But anyway, the `bzhi` part is what needs replacing if you don't have it, not the `lzcnt` part. — Peter Cordes, Apr 27 '22 at 03:02
@harold: `_bzhi_u64(~x, 64 - _lzcnt_u64(x))` avoids needing a constant `-1`, saving an instruction or two depending on compiler. https://godbolt.org/z/edqMf5GMa **The not-flipped bits are all zero.** — Peter Cordes, Apr 27 '22 at 23:55

Peter Cordes · Answer 2 · 2022-04-28T13:05:34.667

TL:DR: use a uint64_t shift to implement efficiently with uint32_t when compiling for 64-bit machines that have lzcnt (AMD since K10, Intel since Haswell). Without lzcnt (only bsr that's baseline for x86) the n==0 case is still special.

For the uint64_t version, the hard part is that you have 65 different possible positions for the highest set bit, including non-existent (lzcnt producing 64 when all bits are zero). But a single shift with 64-bit operand-size on x86 can only produce one of 64 different values (assuming a constant input), since x86 shifts mask the count like foo >> (c&63)

Using a shift requires special-casing one leading-bit-position, typically the n==0 case. As Harold's answer shows, BMI2 bzhi avoids that, allowing bit counts from 0..64.

Same for 32-bit operand-size shifts: they mask c&31. But to generate a mask for uint32_t, we can use a 64-bit shift efficiently on x86-64. (Or 32-bit for uint16_t and uint8_t. Fun fact: x86 asm shifts with 8 or 16-bit operand-size still mask their count mod 32, so they can shift out all the bits without even using a wider operand-size. But 32-bit operand size is efficient, no need to mess with partial-register writes.)

This strategy is even more efficient than bzhi for a type narrower than register width.

// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good

#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
    uint64_t mask32 = 0xffffffff;  // (uint64_t)(uint32_t)-1u32
    // this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
    // If lznct isn't available, we can't avoid handling n==0  zero specially
    uint32_t mask = mask32 >> _lzcnt_u32(n);
    return n ^ mask;
}
#endif

This works equivalently for uint8_t and uint16_t (literally the same code with same mask, using a 32-bit lzcnt on them after zero-extension). But not uint64_t (You could use a unsigned __int128 shift, but shrd masks its shift count mod 64 so compilers still need some conditional behaviour to emulate it. So you might as well do a manual cmov or something, or sbb same,same to generate a 0 or -1 in a register as the mask to be shifted.)

Godbolt with gcc and clang. Note that it's not safe to replace _lzcnt_u32 with __builtin_clz; clang11 and later assume that can't produce 32 even when they compile it to an lzcnt instruction¹, and optimize the shift operand-size down to 32 which will act as mask32 >> clz(n) & 31.

# clang 14 -O3 -march=haswell  (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
        lzcnt   eax, edi           # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt.  Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
        mov     ecx, 4294967295
        shrx    rax, rcx, rax
        xor     eax, edi
        ret

Without BMI2, e.g. with -march=bdver1 or barcelona (aka k10), we get the same code-gen except with shr rax, cl. Those CPUs do still have lzcnt, otherwise this wouldn't compile.

(I'm curious if Intel Skylake Pentium/Celeron run lzcnt as lzcnt or bsf. They lack BMI1/BMI2, but lzcnt has its own feature flag. It seems low-power uarches as recent as Tremont are missing lzcnt, though, according to InstLatx64 for a Pentium Silver N6005 Jasper Lake-D, Tremont core. I didn't manually look for the feature bit in the raw CPUID dumps of recent Pentium/Celeron, but Instlat does have those available if someone wants to check.)

Anyway, bzhi also requires BMI2, so if you're comparing against that for any size but uint64_t, this is the comparison.

This shrx version can keep its -1 constant around in a register across loops. So the mov reg,-1 can be hoisted out of a loop after inlining, if the compiler has a spare register. The best bzhi strategy doesn't need a mask constant so it has nothing to gain. _bzhi_u64(~x, 64 - _lzcnt_u64(x)) is 5 uops, but works for 64-bit integers on 64-bit machines. Its latency critical path length is the same as this. (lzcnt / sub / bzhi).

Without LZCNT, one option might be to always flip as a way to get FLAGS set for CMOV, and use -1 << bsr(n) to XOR some of them back to the original state. This could reduce critical path latency. IDK if a C compiler could be coaxed into emitting this. Especially not if you want to take advantage of the fact that real CPUs keep the BSR destination unchanged if the source was zero, but only AMD documents this fact. (Intel says it's an "undefined" result.)

(TODO: finish this hand-written asm idea.)

Other C ideas for the uint64_t case: cmov or cmp/sbb (to generate a 0 or -1) in parallel with lzcnt to shorten the critical path latency? See the Godbolt link where I was playing with that.

ARM/AArch64 saturate their shift counts, unlike how x86 masks for scalar. If one could take advantage of that safely (without C shift-count UB) that would be neat, allowing something about as good as this.

x86 SIMD shifts also saturate their counts, which Paul R took advantage of with an AVX-512 answer using vlzcnt and variable-shift. (It's not worth copying data to an XMM reg and back for one scalar shift, though; only useful if you have multiple elements to do.)

Footnote 1: clang codegen with `__builtin_clz` or `...ll`

Using __builtin_clzll(n) will get clang to use 64-bit operand-size for the shift, since values from 32 to 63 become possible. But you can't actually use that to compile for CPUs without lzcnt. The 63-bsr a compiler would use without lzcnt available would not produce the 64 we need for that case. Not unless you did n<<=1; / n|=1; or something before the bsr and adjusted the result, but that would be slower than cmov.

If you were using a 64-bit lzcnt, you'd want uint64_t mask = -1ULL since there will be 32 extra leading zeros after zero-extending to uint64_t. Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL

Isn't that weird thing Clang 11+ does with a 32-bit shift a consequence of `__builtin_clz` being undefined for zero, so it thinks that 32 is not a possible shift count? It goes away if I use `_lzcnt_u64` or `__builtin_clzll` there — harold, Apr 28 '22 at 02:08
@harold: ah, that would be it. Yeah, coming back to this today I realized that it needed to be `lzcnt` not `__builtin_clz` which could use `bsr`, but hadn't updated the Godbolt link, and hadn't thought through the implications for the clang code-gen. Thanks. — Peter Cordes, Apr 28 '22 at 02:18

score 3 · Answer 3 · edited Apr 28 '22 at 02:56

3

Here’s a simple example for 32 bit ints that works with gcc and compatible compilers (clang et al), and is portable across most architectures.

uint32_t flip(uint32_t n)
{
    if (n == 0) return 0;
    uint32_t mask = ~0U >> __builtin_clz(n);
    return n ^ mask;
}

DEMO

We could avoid the extra check for n==0 if we used lzcnt on x86-64 (or clz on ARM), and we were using a shift that allowed a count of 32. (In C, shifts by the type-width or larger are undefined behaviour. On x86, in practice the shift count is masked &31 for shifts other than 64-bit, so this could be usable for uint16_t or uint8_t using a uint32_t mask.)

Be careful to avoid C undefined behaviour, including any assumption about __builtin_clz with an input of 0; modern C compilers are not portable assemblers, even though we sometimes wish they were when the language doesn't portably expose the CPU features we want to take advantage of. For example, clang assumes that __builtin_clz(n) can't be 32 even when it compiles it to lzcnt.

See @PeterCordes's answer for details.

edited Apr 28 '22 at 02:56

Peter Cordes

328,167
45
605
847

answered Apr 26 '22 at 05:59

Paul R

208,748
37
389
560

I wonder if this will be faster by a large margin than @harold's alternative solution for ```uint32_t```? – 0xdeadbeef Apr 26 '22 at 06:35
2

Very similar to Harold’s BMI solution, I would expect. The one advantage of the above is that it’s portable across most/all architectures, since the compiler will use the most efficient instruction sequence available for `__builtin_clz` on the target platform. – Paul R Apr 26 '22 at 07:25
1

This makes sense, and could be made branchless by doing `mask = (n == 0 ? 0 : mask);` after the `-1U >> clz`. For 64-bit, use `__builtin_clzll`. With BMI2 for efficient shifts, it can compile even more efficiently. – Peter Cordes Apr 27 '22 at 03:15
1

If you're only doing 32-bit ints on a 64-bit machine, you can use `0xffffffffULL >> _lzcnt_u32(n)` so it can shift out all the bits if the leading-zero count is 32. (Instead of masking the shift-count to 0 for a 32-bit shift on x86 / x86-64.) ARM / AArch64 saturate shift counts, so if there's a way in C to express that, we could conceivably get `clz` and shift to do the trick without a zero check for full-width registers. – Peter Cordes Apr 27 '22 at 03:17
1

Don't *just* use `-mlzcnt`! Use `-march=haswell` to enable BMI1 and BMI2 as well, for efficient variable-count shifts on Intel, and not needing to use CL for the count. (AMD K10 and later have lzcnt, but only Zen for BMI2. Also excludes pentium/celeron pre Icelake). https://godbolt.org/z/Ej4GYxxec . Hrm, looks like a clang code-gen bug in clang 11 and later, using only 32-bit operand-size for my `(uint64_t)-1u32 >> clz` idea. But when compiled correctly, it's only 4 instructions plus a ret. (Or 5 if tuning properly for Haswell not Skylake, breaking the output dependency of `lzcnt`.) – Peter Cordes Apr 27 '22 at 08:09
1

This is still pretty good with just `-mlzcnt`, though, perhaps comparable to Harold's which also requires BMI2 for `bzhi`. Actually for uint32_t on x86-64, 64-bit shift operand-size seems even better. https://godbolt.org/z/Y51x15x85 – Peter Cordes Apr 27 '22 at 08:21
1

(writing this up into an answer myself, BTW. – Peter Cordes Apr 27 '22 at 08:33
Thanks for the feedback - I'd been oblivious to lzcnt v bsr etc previously, so it's been interesting to catch up on this. I look forward to your (no doubt comprehensive!) answer in due course. – Paul R Apr 27 '22 at 08:45
Unfortunately you introduced a bug here: `~0U >> 32` is UB, and on x86 will work like `~0U >> 0`. Using `((uint64_t)~0U) >> lzcnt` is safe, but not with `__builtin_clz`. clang11 and later know that clz(n) produces an undefined result for n==0, and use that to assume that the count can never be 32. (Even if it ends up compiling to an `lzcnt` instruction, unfortunately). I'm just going to basically roll that back in your answer, since I did end up posting an answer myself. – Peter Cordes Apr 28 '22 at 02:39
@PeterCordes: thanks for catching that - I was just trying to save a potentially redundant CMOV when using lzcnt, and of course I messed up. – Paul R Apr 28 '22 at 07:54

Paul R · Answer 4 · 2022-04-27T09:59:25.173

If your use case is performance-critical you might also want to consider a SIMD implementation for performing the bit flipping operation on a large number of elements. Here's an example using AVX512 for 32 bit elements:

void flip(const uint32_t in[], uint32_t out[], size_t n)
{
    assert((n & 7) == 0); // for this example we only handle arrays which are vector multiples in size
    for (size_t i = 0; i + 8 <= n; i += 8)
    {
        __m512i vin = _mm512_loadu_si512(&in[i]);
        __m512i vlz = _mm512_lzcnt_epi32(vin);
        __m512i vmask = _mm512_srlv_epi32(_mm512_set1_epi32(-1), vlz);
        __m512i vout = _mm512_xor_si512(vin, vmask);
        _mm512_storeu_si512(&out[i], vout);
    }
}

This uses the same approach as other solutions, i.e. count leading zeroes, create mask, XOR, but for 32 bit elements it processes 8 elements per loop iteration. You could implement a 64 bit version of this similarly, but unfortunately there are no similar AVX512 intrinsics for element sizes < 32 bits or > 64 bits.

You can see the above 32 bit example in action on Compiler Explorer (note: you might need to hit the refresh button at the bottom of the assembly pane to get it to re-compile and run if you get "Program returned: 139" in the output pane - this seems to be due to a glitch in Compiler Explorer currently).

Oh nice, good one since x86 SIMD shifts saturate their count (so they can shift out all the bits), unlike scalar shifts which mask it so it wraps around. (And unlike ISO C shifts where `uint32_t >> 32` is simply undefined behaviour.) — Peter Cordes, Apr 27 '22 at 13:26

what is the most efficient way to flip all the bits from the least significant bit up to the most significant last 1 bit value?

4 Answers4

Footnote 1: clang codegen with __builtin_clz or ...ll

Footnote 1: clang codegen with `__builtin_clz` or `...ll`