As said. e.g. for the 8-bit(just for example, no byte order considered) integer 00100100 (base 2), is there an instruction gives 5?
Asked
Active
Viewed 2,758 times
2
-
possible duplicate of [Bit twiddling: which bit is set?](http://stackoverflow.com/questions/3465098/bit-twiddling-which-bit-is-set) – Will A Sep 26 '10 at 04:24
-
For using it from C or C++, see [What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?](https://stackoverflow.com/q/671815). (And my answer on [Find most significant bit (left-most) that is set in a bit array](https://stackoverflow.com/a/54760134) for details) – Peter Cordes May 22 '21 at 23:19
2 Answers
5
Technically, no. There's BSR to find the most significant bit that's set, and BSF to find the least significant bit that's set -- but the smallest item either will work on is a 16-bit word.

Parthian Shot
- 1,390
- 1
- 13
- 24

Jerry Coffin
- 476,176
- 80
- 629
- 1,111
-
For a byte, just make sure it's zero-extended into a wider register (like you'd normally do anyway). Since BSR and BSF both find bit-indexes, the extra leading zeros don't matter. (Or with LZCNT which is faster on AMD, use 31-lzcnt e.g. `lzcnt eax, eax` / `xor eax, 31` using the same trick GCC uses to implement `__builtin_clz` in terms of `bsr`) – Peter Cordes May 22 '21 at 23:21
3
Yep, BSR. However, do note that the bithack page claims that at least on one CPU a sequence (unrolled loop) of bit shift operations is faster than a single BSR.

Alex Martelli
- 854,459
- 170
- 1,222
- 1,395
-
1I'm confused by the comment on the bithacks page. BSR by itself doesn't compute the next highest power-of-two. You have to use `64 - BSR(v - 1)`, which is obviously not "a single BSR assembly language instruction". Perhaps I'm just misreading it. – Marcelo Cantos May 21 '13 at 21:46
-
Not faster. From the bithack page: "On a Athlon™ XP 2100+ I've found the above shift-left and then OR code is as fast as using a single BSR assembly language instruction" – liorda Jun 02 '16 at 09:14
-
Some old AMD CPUs have slow microcoded `bsr` / `bsf` (which are thoroughly obsolete 10 years after this answer was posted). They're down to about 7 uops on modern AMD, with 3 cycle latency (https://agner.org/optimize/), vs. 1 uop / 3c latency on Intel. (Except for the "false" dependency on the output, from its funky behaviour of leaving the destination unchanged on src=0, documented by AMD but implemented by both AMD and Intel.) (`lzcnt` / `tzcnt` are single-uop on all CPUs that support them.) – Peter Cordes May 22 '21 at 23:28
-
See also [How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?](https://stackoverflow.com/q/54509623) re: performance in general and quirks. – Peter Cordes May 22 '21 at 23:28