0

The page https://software.intel.com/content/www/us/en/develop/blogs/haswell-new-instruction-descriptions-now-available.html mentions BLSR instruction that can efficiently clear the least significant bit set. Is there an efficient equivalent of such instruction for clearing the most significant bit? For example 001001010101000 should be turned to 000001010101000. Is it possible (and how) to implement it in C++ more efficiently than just with a bit shifts?

efficient bit-manipulation instructions

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Curious
  • 507
  • 3
  • 16
  • add/sub carry from LSB to MSB, so it's a totally different problem. Unless you're on ARM, then you can use `rbit` to bit-reverse before/after `x &= x -1` (the bithack x86 BLSR uses). – Peter Cordes May 06 '21 at 12:55
  • Did you consider `bsr` / `btr` to find and then reset that bit, if you're talking about x86 assembly specifically? – Peter Cordes May 06 '21 at 13:01
  • @PeterCordes Yes – Curious May 06 '21 at 13:04
  • 1
    I just realized that maybe BZHI instruction together with LZCNT may also solve the problem. Unfortunately, it is more than a single instruction. – Curious May 06 '21 at 13:07
  • 2
    `lzcnt` doesn't give you a bit-position, it gives you `31-bsr`. But yeah, if you're optimizing for AMD (where BSR is slower than LZCNT), then it's worth considering LZCNT + xor reg, 31 + BZHI. (LZCNT can only produce `32` for an input of zero, in which case it doesn't matter what bit-index you give to BZHI, so it's safe to use `x ^ 31` to do `31 - x` for 5-bit numbers. Like GCC does for `__builtin_clz` using BSR) – Peter Cordes May 06 '21 at 13:21
  • 1
    Kind of ironic that they chose to provide the bit operations that were already trivial to implement with a couple of existing fast instructions, rather than the ones that are hard to do in software and would greatly benefit from hardware support. I guess maybe this meant they didn't have to actually implement any new hardware (and maybe BLSR just decodes to DEC/AND uops), so maybe they were very cheap to provide, but it seems sort of pointless. – Nate Eldredge May 06 '21 at 15:25
  • @NateEldredge: AMD chooses to decode `blsr`/`blsi` to 2 uops (no new hardware outside the decoders I guess); Intel to one uop. But Intel decodes `bextr` to 2 uops, AMD to 1. Yeah, most of BMI1 is minor peephole optimizations for bithacks. When they're single uop, they have lower latency, which can maybe be nice if looping over the set bits of something with `blsr`. BMI2 `bzhi` is not as simple a bithack, and the instruction works for corner cases of 0 and 32 (with saturation). Other BMI1 instructions include VEX-coded `andn`, and lzcnt / tzcnt, finally giving bit-scan that works for 0. – Peter Cordes May 06 '21 at 21:52
  • @NateEldredge: And of course BMI2 `pdep` / `pext` require significant hardware (Intel), or else *very* slow microcoded fallback (AMD). And BMI2 has useful 3-operand shifts that don't waste uops on FLAGS updates, as well as working as copy-and-shift with VEX encoding. – Peter Cordes May 06 '21 at 21:54
  • 1
    @Peter - AMD Zen 3 has efficient HW PDEP/PEXT finally. – BeeOnRope May 08 '21 at 06:47

0 Answers0