Determine which single bit in the byte is set

Question

I have a byte I'm using for bitflags. I know that one and only one bit in the byte is set at any give time.

Ex: unsigned char b = 0x20; //(00100000) 6th most bit set

I currently use the following loop to determine which bit is set:

int getSetBitLocation(unsigned char b) {
  int i=0;
  while( !((b >> i++) & 0x01) ) { ; }
  return i;
}

How do I most efficiently determine the position of the set bit? Can I do this without iteration?

Assuming an 8-bit byte, the use of a lookup table might help, unless you can make use of a 'count leading zeroes' primitive / instruction. — Brett Hale, Jan 20 '13 at 21:42
"Can I do this without iteration?" Use a lookup table or a switch statement. — David Heffernan, Jan 20 '13 at 21:44
I was hoping for a bit-twiddling hack, not an obvious `switch` statement — recursion.ninja, Jan 20 '13 at 21:46
(Make sure to run a performance analysis over *real input* in the *real execution/application context*; specific compiler code generation, branch prediction, caching, and other factors come into play and can entirely mitigate any "performance increases". Also, it often Just Doesn't Matter - 97/3.) — , Jan 20 '13 at 21:47
You can get to three comparisons and some math. Will it be actually faster than this loop? Hard to tell. — John Dvorak, Jan 20 '13 at 21:48
If this is your bottleneck, then you won't beat lookup table. I'm guessing that you've not actually done any profiling and are optimising without the benefit of that. — David Heffernan, Jan 20 '13 at 21:57
"most efficiently" would be with your compilers built-in function for this, if there is one. Compilers are unlikely to detect what all the crazy tricks mentions in the answers are actually trying to do and convert them to something more sensible. Use those tricks at your own risk. — harold, Jan 21 '13 at 08:18
[Here's the most pure "bit-twiddling hack"](http://stackoverflow.com/a/14430221/4279) that you've asked among all posted answers. It doesn't use lookups or switch/if/else branching. For performance, consider solutions that process more than one byte at a time e.g., [similar to solutions for xor operation](http://stackoverflow.com/questions/2119761/simple-python-challenge-fastest-bitwise-xor-on-data-buffers) — jfs, Jan 22 '13 at 03:29

score 7 · Accepted Answer · edited Feb 22 '15 at 21:09

7

Can I do this without iteration?

It is indeed possible.

How do I most efficiently determine the position of the set bit?

You can try this algorithm. It splits the char in half to search for the top bit, shifting to the low half each time:

int getTopSetBit(unsigned char b) {
  int res = 0;
  if(b>15){
    b = b >> 4;
    res = res + 4;
  }
  if(b>3){
    b = b >> 2;
    res = res + 2;
  }

  //thanks @JasonD
  return res + (b>>1);
}

It uses two comparisons (three for uint16s, four for uint32s...). and it might be faster than your loop. It is definitely not shorter.

Based on the idea by Anton Kovalenko (hashed lookup) and the comment by 6502 (division is slow), I also suggest this implementation (8-bit => 3-bit hash using a de-Bruijn sequence)

int[] lookup = {7, 0, 5, 1, 6, 4, 3, 2};

int getBitPosition(unsigned char b) {
  // return lookup[(b | (b>>1) | (b>>2) | (b>>4)) & 0x7];
  return lookup[((b * 0x1D) >> 4) & 0x7];
}

or (larger LUT, but uses just three terms instead of four)

int[] lookup = {0xFF, 0, 1, 4, 2, 0xFF, 5, 0xFF, 7, 3, 0xFF, 0xFF, 6, 0xFF, 0xFF, 0xFF};

int getBitPosition(unsigned char b) {
  return lookup[(b | (b>>3) | (b>>4)) & 0xF];
}

edited Feb 22 '15 at 21:09

Izzy

402
6
16

answered Jan 20 '13 at 21:54

John Dvorak

26,799
13
69
83

You can remove the last if() and just add b-1 – JasonD Jan 20 '13 at 21:57
@JasonD not `b-1`, but `b>>1`. `b` could be `2` or `3`. – John Dvorak Jan 20 '13 at 21:59
@JanDvorak clever, this *might* be more efficient – recursion.ninja Jan 20 '13 at 22:00
2

@JanDvorak He said only 1 bit was set, so it should only be 1 or 2. Your alternative is more general though. – JasonD Jan 20 '13 at 22:01
@JasonD you are right; `b-1` would do as well in the restricted case (and it would be better if you want to return -1 if no bit is set) – John Dvorak Jan 20 '13 at 22:03
1

@JanDvorak Your hashed lookup-table `lookup[((b * 0x1D) >> 4) & 0x7];` passed my 8 test-cases for bytes `{ 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80 }` when checking for the location of the only set bit. Very elegant! Might I ask where you got the hash from? Could you link to the De-Brujin sequences? – recursion.ninja Jan 20 '13 at 23:51
@awashburn I was looking for a set of rotations that, when summed, would not produce collisions in the bottom three or four bits. This turns out to be isomorphic to finding a cyclic sequence of eight bits that have no duplicate trigrams - which (all n-grams are unique) is exactly the requirement for a de-Brujin sequence. Finding one consisted of choosing for each rotation to be active or not. It then turned out to be possible to replace rotations by shifts (if a bit becomes interesting, it's due to a left shift). Cf. http://en.wikipedia.org/wiki/De_Bruijn_sequence – John Dvorak Jan 21 '13 at 00:04
Cool trick! I had always used something like `lut[foo % 67]` to turn `foo` into `log2(foo)` when `foo` is a power of two. – tmyklebu Jan 21 '13 at 02:52

Anton Kovalenko · Answer 2 · 2013-01-20T22:20:43.093

5

Lookup table is simple enough, and you can reduce its size if the set of values is sparse. Let's try with 11 elements instead of 128:

unsigned char expt2mod11_bits[11]={0xFF,0,1,0xFF,2,4,0xFF,7,3,6,5};
unsigned char pos = expt2mod11_bits[b%11];
assert(pos < 8);
assert(1<<pos == b);

Of course, it's not necessarily more effective, especially for 8 bits, but the same trick can be used for larger sizes, where full lookup table would be awfully big. Let's see:

unsigned int w; 
....
unsigned char expt2mod19_bits[19]={0xFF,0,1,13,2,0xFF,14,6,3,8,0xFF,12,15,5,7,11,4,10,9};
unsigned char pos = expt2mod19_bits[w%19];
assert(pos < 16);
assert(1<<pos == w);

edited Jan 20 '13 at 22:20

answered Jan 20 '13 at 22:01

Anton Kovalenko

20,999
2
37
69

Nice idea for larger sizes... but for 8 bit I suppose that the modulo operation is going to be a problem. – 6502 Jan 20 '13 at 22:08
Interesting -- did you discover 11 by trial and error, or is there some principle involved? – Vaughn Cato Jan 20 '13 at 22:09
@6502 why do you think it's a problem? You just need a module that avoids collisions. – John Dvorak Jan 20 '13 at 22:09
@JanDvorak: because it's going to be slow – 6502 Jan 20 '13 at 22:10
@6502 `mod` is one instruction on x86. I predict this will take two clock cycles at _most_ (modulo and lookup). – John Dvorak Jan 20 '13 at 22:11
2

On x86, I would try inline assembly with BSF. – Anton Kovalenko Jan 20 '13 at 22:15
1

@AntonKovalenko except mod-11 is platform-agnostic while BSF isn't. – John Dvorak Jan 20 '13 at 22:19
1

@JanDvorak: you're a bit too optimistic. Division has always been annoyingly slow... so slow that when designing the first Pentium processor they even tried to cut some corners ;-). If you need the quotient the division can be traded for a multiplication in many cases, but if you need the remainder that trick doesn't work. And not even integer multiplication IIRC is 2 cycles on any processor I used. – 6502 Jan 20 '13 at 22:22
@6502 then binary search might be the fastest solution. I'm wondering if it's possible to create a faster hash function than `mod 11`. – John Dvorak Jan 20 '13 at 22:24
Division by a constant can be optimised by modern compilers to remove the division. Try `gcc -S -O -x c - -o - <<< "int f(int x) { return x % 11; }"` – Jan 20 '13 at 22:30
2

@JanDvorak It doesn't fit in the comment area, but it starts with a multiplication by 780903145. Note that that number is exactly `0x200000003 / 11`. – Jan 20 '13 at 22:33
@JanDvorak: if that computation is the bottleneck then it has to be done an awful number of times. In that case I think that using some L1 cache is not going to be a problem and a lookup in a static 256-entries table is just one operation. Only a real check with real data can however provide an answer. Also I think that it could make sense for example process two or four bytes at a time but more context is needed... – 6502 Jan 20 '13 at 22:43
@hvd would a hash function based on two additions and two bitshifts be faster than a hash function based on a single modulo? – John Dvorak Jan 20 '13 at 22:51
@6502: it still wastes a cache slot. Which could be used for different parts (the outer code) of the program. – wildplasser Jan 20 '13 at 22:52
I added a solution that uses a different hash than `%11` to my answer. It should be faster than this one. – John Dvorak Jan 20 '13 at 23:00

6502 · Answer 3 · 2013-01-20T22:11:42.157

3

This is a quite common problem for chess programs that use 64 bits to represent positions (i.e. one 64-bit number to store where are all the white pawns, another for where are all the black ones and so on).

With this representation there is sometimes the need to find the index 0...63 of the first or last set bit and there are several possible approaches:

Just doing a loop like you did
Using a dichotomic search (i.e. if x & 0x00000000ffffffffULL is zero there's no need to check low 32 bits)
Using special instruction if available on the processor (e.g. bsf and bsr on x86)
Using lookup tables (of course not for the whole 64-bit value, but for 8 or 16 bits)

What is faster however really depends on your hardware and on real use cases. For 8 bits only and a modern processor I think that probably a lookup table with 256 entries is the best choice...

But are you really sure this is the bottleneck of your algorithm?

edited Jan 20 '13 at 22:11

answered Jan 20 '13 at 22:05

6502

112,025
15
165
265

`...FULL` means "unsigned long long"? – John Dvorak Jan 20 '13 at 22:06
`ULL` is the suffix, the `F` is just last hexadecimal digit, I'll edit to make it more clear – 6502 Jan 20 '13 at 22:11
I meant, is it actually compilable, or just a placeholder/elliplis :-) – John Dvorak Jan 20 '13 at 22:13

wildplasser · Answer 4 · 2013-01-20T22:29:44.090

2

unsigned getSetBitLocation(unsigned char b) {
  unsigned pos=0;
  pos = (b & 0xf0) ? 4 : 0; b |= b >>4;
  pos += (b & 0xc) ? 2 : 0; b |= b >>2;
  pos += (b & 0x2) ? 1 : 0; 
  return pos; 
}

It would be hard to do it jumpfree. Maybe with the Bruin sequences ?

edited Jan 20 '13 at 22:29

answered Jan 20 '13 at 21:55

wildplasser

43,142
8
66
109

score 2 · Answer 5 · answered Jan 20 '13 at 22:47

2

Based on log2 calculation in Find the log base 2 of an N-bit integer in O(lg(N)) operations:

int getSetBitLocation(unsigned char c) {
  // c is in {1, 2, 4, 8, 16, 32, 64, 128}, returned values are {0, 1, ..., 7}
  return (((c & 0xAA) != 0) |
          (((c & 0xCC) != 0) << 1) |
          (((c & 0xF0) != 0) << 2));
}

answered Jan 20 '13 at 22:47

jfs

399,953
195
994
1,670

This relies on the fact that `!=` returns `0` or `1`. Is that guaranteed by the spec? I also guess it doesn't really avoid branching due to how `!=` is implemented. – John Dvorak Jan 20 '13 at 22:49
@JanDvorak: yes, it was added in c99. (I used the ternary to be downward compatible with c89/c90) And it indeeds looks not jumpfree. (but there are some strange instructions nowadays) – wildplasser Jan 20 '13 at 22:54
@JanDvorak: [yes, it is guaranteed even before c99](http://stackoverflow.com/a/10632270/4279). Also I don't see any branching in the code. You could [see its assembly online](http://assembly.ynh.io/). Could you point out the branching instructions? – jfs Jan 22 '13 at 02:52
@J.F.Sebastian Without optimisation, wrapped in `int main(char c)`, the three `!=`s turn to `set if not equal` on line `0x14` (impressive), `0x25: je 0x2E / 0x2C: jmp 0x33` (ouch, a branching instruction), and the same combo appears at `0x40`. With optimisations turned on, the `je`s turn to `subtract with borrow`s (impressive) ARM uses a `move if not equal / move if equals` combo (nice instructions). Thanks for the link to the compiler. – John Dvorak Jan 22 '13 at 08:15
Note that by deBrujin-based solution turns to mere five operations: `MOVZB, IMUL, SAR, AND, MOV` – John Dvorak Jan 22 '13 at 08:35
@JanDvorak: you can paste the code as is. The online service accepts the function without naming it `main()`. There is no point worrying about branching if you disable optimizations in the compiler. De Bruijn sequences look interesting. – jfs Jan 22 '13 at 09:34
I was surprised it didn't mung my `IMUL` to a series of additions. – John Dvorak Jan 22 '13 at 09:37

score 1 · Answer 6 · edited May 23 '17 at 12:15

1

Easiest thing is to create a lookup table. The simplest one will be sparse (having 256 elements) but it would technically avoid iteration.

This comment here technically avoids iteration, but who are we kidding, it is still doing the same number of checks: How to write log base(2) in c/c++

Closed form would be log2(), a la, log2() + 1 But I'm not sure how efficient that is - possibly the CPU has an instruction for taking base 2 logrithms?

edited May 23 '17 at 12:15

Community

1
1

answered Jan 20 '13 at 21:50

poundifdef

18,726
23
95
134

`FYL2X` instruction v/s bit-shifts? Not sure there. – Anirudh Ramanathan Jan 20 '13 at 21:57
The memory requirements of a sparse look-up table with 255 elements do not outweigh the minor performance optimization. – recursion.ninja Jan 20 '13 at 22:01
How many elements will a sparse lookup table have when CHAR_BIT == 16, or when CHAR_BIT == 32? – autistic Jan 20 '13 at 22:03
“possibly the CPU has an instruction for taking base 2 logrithms?” There is an instruction to count leading zeroes in most processors. – Pascal Cuoq Jan 21 '13 at 02:45

woryzower · Answer 7 · 2013-01-20T22:21:55.980

0

if you define

const char bytes[]={1,2,4,8,16,32,64,128}

and use

struct byte{
char data;
int pos;
}
void assign(struct byte b,int i){

b.data=bytes[i];
b.pos=i
}

you don't need to determine the position of the set bit

edited Jan 20 '13 at 22:21

answered Jan 20 '13 at 21:55

woryzower

956
3
15
22

`you don't need to determine the position of the set bit` -- I'd like to dispute – John Dvorak Jan 20 '13 at 21:56
He's fighting the OPPOSITE problem... given the value finding the index. Scanning your table is more or less like just checking the bits. – 6502 Jan 20 '13 at 22:04

autistic · Answer 8 · 2016-01-23T13:05:31.743

A lookup table is fast and easy when CHAR_BIT == 8, but on some systems, CHAR_BIT == 16 or 32 and a lookup table becomes insanely bulky. If you're considering a lookup table, I'd suggest wrapping it; make it a "lookup table function", instead, so that you can swap the logic when you need to optimise.

Using divide and conquer, by performing a binary search on a sorted array, involves comparisons based on log2 CHAR_BIT. That code is more complex, involving an initialisation of an array of unsigned char to use as a lookup table for a start. Once you have such the array initialised, you can use bsearch to search it, for example:

#include <stdio.h>
#include <stdlib.h>
void uchar_bit_init(unsigned char *table) {
    for (size_t x = 0; x < CHAR_BIT; x++) {
        table[x] = 1U << x;
    }
}
int uchar_compare(void const *x, void const *y) {
    char const *X = x, *Y = y;
    return (*X > *Y) - (*X < *Y);
}
size_t uchar_bit_lookup(unsigned char *table, unsigned char value) {
    unsigned char *position = bsearch(lookup, c, sizeof lookup, 1, char_compare);
    return position ? position - table + 1 : 0;
}
int main(void) {
    unsigned char lookup[CHAR_BIT];
    uchar_bit_init(lookup);
    for (;;) {
        int c = getchar();
        if (c == EOF) { break; }
        printf("Bit for %c found at %zu\n", c, uchar_bit_lookup(lookup, c));
    }
}

P.S. This sounds like micro-optimisation. Get your solution done (abstracting the operations required into these functions), then worry about optimisations based on your profiling. Make sure your profiling targets the system that your solution will run on if you're going to focus on micro-optimisations, because the efficiency of micro-optimisations differ widely as hardware differs even slightly... It's usually a better idea to buy a faster PC ;)

Determine which single bit in the byte is set

8 Answers8

Linked

Related