3

I am needing to figure out a bit-hack to find the first location where the bit-value changes. It could be from 0 to 1 or 1 to 0.

// This would give 0 since no changes happen
let x1 = 0b00000000

// This also gives 0 since there is no change
let x2 = 0b11111111

// This would give 1 since the value changes after the first bit
let x3 = 0b10000000

// This would also give 1 since the value change 
// happens at the first bit
let x4 = 0b01111111

// This would return 7 since the change is at the 7th bit
let x5 = 0b00000001

// This would also return 7
let x6 = 0b11111110

Any recommendations on an incantation that would give these results?

Matthew Crews
  • 4,105
  • 7
  • 33
  • 57
  • Are you looking for `BSR` aka `_BitScanReverse` aka `lzcnt` aka `std::countl_zero`? – Alex Guteniev Feb 08 '22 at 19:31
  • Similar but I need it to be tolerant of leading 0s or leading 1s which is why I provide examples. I'm looking for the first place it changes from 1 to 0 or 0 to 1. – Matthew Crews Feb 08 '22 at 19:34
  • 1
    If the leading bit is set, NOT it before bit-scan. Like `x = (x<0) ? ~x : x;` with `mov ecx, eax` / `xor eax, -1` (like NOT but sets FLAGS) / `cmovns ecx, eax`. – Peter Cordes Feb 08 '22 at 20:03
  • leading 1s are converted to leading 0s by taking bitwise inverse, that is NOT. – qwr Feb 08 '22 at 20:24

2 Answers2

3

The key building-block is a bit-scan like bsr. (Or 31-lzcnt).
That finds the position of the highest 1 bit, and we can transform your input to work for this.

If the leading bit is set, NOT it before bit-scan. Like x = ((int32_t)x<0) ? ~x : x; with mov ecx, eax / xor eax, -1 (like NOT but sets FLAGS) / cmovns ecx, eax. So like an absolute-value idiom, but with NOT instead of NEG . Another option is sar reg,31 (or cdq) / xor with that to flip or not.

#include <bit>          // C++20
#include <stdint.h>

int transition_position(uint32_t x)
{
    x = ((int32_t)x<0) ? ~x : x;            // make the leading bits zero
    if (x == 0) __builtin_unreachable();    // optional, x|=1 is another way to handle the case where all bits are the same, allowing BSF without branching
    return std::countl_zero(x) ^ 31;        // 31-lzcnt in a way that GCC can optimize back to bsf
}

This compiles nicely with GCC / clang: https://godbolt.org/z/63nEcnTso

gcc11.2 and clang13 -O3 both make this asm
transition_position(unsigned int):
        mov     eax, edi
        sar     eax, 31            # missed optimization: should shift EDI so mov is off the critical path on CPUs without mov-elimination
        xor     eax, edi
        bsr     eax, eax
        ret

With -march=haswell, they both de-optimize by using lzcnt and then an actual xor eax,31. (That asm is still better for AMD, where bsf is significantly slower than tzcnt/lzcnt, so it's good for -mtune=generic -mbmi, but not for -march=haswell)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

Like it has been said, you should first NOT it if you have leading ones. I only knew the cdq method for this. I learned new commands from that answer. But if you want to implement the bit hacks to do them or you do not have the commands you can try this for the leading zero count:

http://aggregate.org/MAGIC/#Leading%20Zero%20Count

it also uses the ones count:

http://aggregate.org/MAGIC/#Population%20Count%20(Ones%20Count)

there are many hacks for that, you can find also at:

http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive

If you have a number that changes multiple times to 0 and 1 and you want to search it starting from least significant bit, get the most significant bit:

http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit

and then do trailing zero count:

http://aggregate.org/MAGIC/#Trailing%20Zero%20Count

another trailing zero count method I learned yesterday is by De Bruijn sequence and table lookup, you can find that here:

https://www.youtube.com/watch?v=ZusiKXcz_ac&t=3619s

at about minute 43.

  • http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit sucks compared to using one hardware instruction to do a leading-zero count. (31-lzcnt = bit position in a uint32_t). Your links are good if you needed portability in a language that sucked and didn't expose what most CPUs can do efficiently (like C, or C++ before C++20 - in those languages you'd need intrinsics like `_lzcnt_u32`). Those algorithms just using shifts and bitwise booleans are clever, but still not as fast as a single-uop instruction. But this is an x86 question so you can use an intrinsic for BSR, guaranteed available. – Peter Cordes Feb 11 '22 at 07:16
  • And BTW, http://aggregate.org/MAGIC/#Population%20Count%20(Ones%20Count) claims "*Although many machines have single instructions for this, the single instructions are usually microcoded loops that test a bit per cycle; a log-time algorithm coded in C is often faster*". That's totally bogus. All x86 CPUs that support `popcnt` run it in constant time (https://uops.info); and I think that's true of basically every CPU for other ISAs, too. I'm pretty sure it takes less hardware than a multiply unit; Intel does it on the same execution unit as bsf/bsr/lzcnt/tzcnt, sharing transistors. – Peter Cordes Feb 11 '22 at 07:22
  • Yes sir, I am sure you are correct. I just found the bit hacks interesting and posted them. As I stated yours is the most efficient answer. – Javier Munoz Feb 11 '22 at 07:24
  • It is however sometimes necessary to have a popcnt fallback for portability, especially across ISAs. SO has a Q&A about that, [How to count the number of set bits in a 32-bit integer?](https://stackoverflow.com/q/109023) showing a SWAR popcount function using `(i * 0x01010101) >> 24` as the last step, summing 4x or 8x 8-bit sums. – Peter Cordes Feb 11 '22 at 07:25
  • _One additional note: the AMD Athlon optimization guidelines suggest a very similar algorithm that replaces the last three lines with return((x * 0x01010101) >> 24)_ [Magic](http://aggregate.org/MAGIC/#Population%20Count%20(Ones%20Count)) – Javier Munoz Feb 12 '22 at 07:36