144

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.

A trivial implementation is this:

unsigned GetLowestBitPos(unsigned value)
{
   assert(value != 0); // handled separately

   unsigned pos = 0;
   while (!(value & 1))
   {
      value >>= 1;
      ++pos;
   }
   return pos;
}

Any ideas how to squeeze some cycles out of it?

(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)

[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!

peterchen
  • 40,917
  • 20
  • 104
  • 186
  • while ( (value _N >> (++pos)) != 0 ); – Thomas Jul 26 '12 at 09:01
  • 1
    related: [position of the only 1 in a number in binary format](http://codegolf.stackexchange.com/questions/4093/position-of-the-only-1-in-a-number-in-binary-format/103531#103531) – Peter Cordes Dec 18 '16 at 08:08

23 Answers23

198

Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:

unsigned int v;  // find the number of trailing zeros in 32-bit v 
int r;           // result goes here
static const int MultiplyDeBruijnBitPosition[32] = 
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];

Helpful references:

khelwood
  • 55,782
  • 14
  • 81
  • 108
Anton Tykhyy
  • 19,370
  • 5
  • 54
  • 56
  • 20
    Why the downvote? This is possibly the fastest implementation, depending on the speed of the multiplication. It's certainly code compact, and the (v & -v) trick is something everyone should learn and remember. – Adam Davis Apr 16 '09 at 17:52
  • 2
    +1 very cool, how expensive is a multiply operation though compared to an if(X&Y) operation? – Brian R. Bondy Apr 16 '09 at 18:05
  • Depends on the math processor. A single 32x32 integer multiply with no overflow, though, is exceptionally fast - this is what video games are built on. I'd be interested in seeing benchmarks though... – Adam Davis Apr 16 '09 at 18:21
  • That multiply is fast nowadays :) Thanks Anton, I had browse Bit Twiddling Hacks casually, but completely missed this item. d'oh! – peterchen Apr 20 '09 at 07:59
  • 1
    This code is fast. I'm getting 497 million ops/sec in .Net on a 2GHz CPU. Nothing wrong with that! – IamIC Sep 29 '10 at 21:09
  • 5
    Does anybody know how the performance of this compares to the `__builtin_ffsl` or `ffsl`? – Steven Lu Dec 09 '12 at 00:46
  • The multiply is completely unnecessary if you don't need the table to be compact, and why would you need that? See RaviSharma's better solution. – Jim Balter May 19 '13 at 03:57
  • Cool, used this as part of my answer for question: http://stackoverflow.com/questions/19106342/checking-the-number-of-bits-on-in-a-byte – hyde Oct 01 '13 at 13:54
  • 3
    @Jim Balter, but modulo is very slow compared to multiplication on modern hardware. So I wouldn't call it a better solution. – Apriori Mar 28 '14 at 01:37
  • Umm... What about those programs that are strictly only 32 bit? This won't work. – Corey Iles Jul 01 '16 at 09:22
  • 3
    It seems to me that both value 0x01 and 0x00 result in value 0 from the array. Apparently this trick will indicate that the lowest bit is set if 0 is passed in! – abelenky Sep 16 '16 at 03:16
  • I've added an answer to expand on this one, as I found some issues turning this into a C++11 constexpr function, I also found the same issue mentioned by @abelenky, and I propose a solution for that. – Rodrigo Hernandez Mar 30 '17 at 18:26
  • `error C4146: unary minus operator applied to unsigned type, result still unsigned` I guess it should be `(v & (0-v))` in order not to cause any warnings – ST3 Nov 30 '17 at 14:59
  • How did this table `MultiplyDeBruijnBitPosition` get generated? Is it just magically constructed? [What is the algorithm that generates it?](https://stackoverflow.com/questions/71934050/how-to-generate-the-count-trailing-zeroes-table-composed-of-a-de-bruijn-sequen) – Lance Apr 20 '22 at 10:35
85

Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)

ffs(3) - Linux man page

Name

ffs - find first bit set in a word

Synopsis

#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);

Description

The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.

Return Value

These functions return the position of the first bit set, or 0 if no bits are set in i.

Conforming to

4.3BSD, POSIX.1-2001.

Notes

BSD systems have a prototype in <string.h>.

Community
  • 1
  • 1
ephemient
  • 198,619
  • 38
  • 280
  • 391
51

There is an x86 assembly instruction (bsf) that will do it. :)

More optimized?!

Side Note:

Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.

Mehrdad Afshari
  • 414,610
  • 91
  • 852
  • 789
  • 20
    @dwc: I understand, but I think this clause: "Any ideas how to squeeze some cycles out of it?" makes such an answer perfectly acceptable! – Mehrdad Afshari Apr 16 '09 at 17:08
  • 5
    +1 His answer is necessarily dependent on his architecture because of endianness, so dropping down to assembly instructions is a perfectly valid answer. – Chris Lutz Apr 16 '09 at 17:10
  • Chris: how is the code endian dependant? Maybe I'm missing something, but I don't see it. – dwc Apr 16 '09 at 17:12
  • 3
    +1 Clever answer, yes it isn't C or C++ but it is the right tool for the job. – Andrew Hare Apr 16 '09 at 17:12
  • 2
    Wait, nevermind. The actual value of the integer doesn't matter here. Sorry. – Chris Lutz Apr 16 '09 at 17:17
  • Unless I am also missing something, there's nothing with endianness going on here. A right shift is a right shift, regardless of endianness. Likewise ANDing with 1. – rmeador Apr 16 '09 at 18:11
  • It's important to know that the result of those instructions is undefined when the value you're scanning is zero. – Bastien Léonard Apr 16 '09 at 20:27
  • 2
    @Bastian: They set ZF=1 if the operand is zero. – Mehrdad Afshari Apr 16 '09 at 20:31
  • 2
    Sure it's not C, but assuming you know you'll be running on an x86 system, including it is a one-liner: asm("bsf %[n],%[n];" : [n]"+r"(n)); – Chris Mar 26 '11 at 21:35
48

Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.

If you have any one instruction of this class you can cheaply emulate the others.

Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.

If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.

moonshadow
  • 86,889
  • 7
  • 82
  • 122
25

Here is a benchmark comparing several solutions:

My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.

$ gcc --version
gcc.exe (GCC) 4.7.2

$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop.         Time = 2.91  (Original questioner)
De Bruijn multiply. Time = 1.16  (Tykhyy)
Lookup table.       Time = 0.36  (Andrew Grant)
FFS instruction.    Time = 0.90  (ephemient)
Branch free mask.   Time = 3.48  (Dan / Jim Balter)
Double hack.        Time = 3.41  (DocMax)

$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop.         Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table.       Time = 0.35
FFS instruction.    Time = 0.68
Branch free mask.   Time = 3.49
Double hack.        Time = 0.92

My code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>


#define ARRAY_SIZE 65536
#define NUM_ITERS 5000  // Number of times to process array


int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned value = nums[i];
            if (value == 0)
                continue;
            unsigned pos = 0;
            while (!(value & 1))
            {
                value >>= 1;
                ++pos;
            }
            total += pos + 1;
        }
    }
    
    return total;
}


int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
    static const int MultiplyDeBruijnBitPosition[32] = 
    {
       1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9, 
       32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
    };
      
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned int c = nums[i];
            total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
        }
    }
    
    return total;
}


unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
    unsigned mask = 1;
    for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
        if (num & mask) {
            return cnt;
        }
    }
    
    return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned int value = nums[i];
            // note that order to check indices will depend whether you are on a big 
            // or little endian machine. This is for little-endian
            unsigned char *bytes = (unsigned char *)&value;
            if (bytes[0])
                total += lowestBitTable[bytes[0]];
            else if (bytes[1])
              total += lowestBitTable[bytes[1]] + 8;
            else if (bytes[2])
              total += lowestBitTable[bytes[2]] + 16;
            else
              total += lowestBitTable[bytes[3]] + 24;
        }
    }
    
    return total;
}


int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            total +=  __builtin_ffs(nums[i]);
        }
    }
    
    return total;
}


int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned value = nums[i];
            int i16 = !(value & 0xffff) << 4;
            value >>= i16;

            int i8 = !(value & 0xff) << 3;
            value >>= i8;

            int i4 = !(value & 0xf) << 2;
            value >>= i4;

            int i2 = !(value & 0x3) << 1;
            value >>= i2;

            int i1 = !(value & 0x1);

            int i0 = (value >> i1) & 1? 0 : -32;

            total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
        }
    }
    
    return total;
}


int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned value = nums[i];
            double d = value ^ (value - !!value); 
            total += (((int*)&d)[1]>>20)-1022; 
        }
    }
    
    return total;
}


int main() {
    unsigned nums[ARRAY_SIZE];
    for (int i = 0; i < ARRAY_SIZE; i++) {
        nums[i] = rand() + (rand() << 15);
    }
    
    for (int i = 0; i < 256; i++) {
        lowestBitTable[i] = get_lowest_set_bit(i);
    }
    
    
    clock_t start_time, end_time;
    int result;
    
    start_time = clock();
    result = find_first_bits_naive_loop(nums);
    end_time = clock();
    printf("Naive loop.         Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_de_bruijn(nums);
    end_time = clock();
    printf("De Bruijn multiply. Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_lookup_table(nums);
    end_time = clock();
    printf("Lookup table.       Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_ffs_instruction(nums);
    end_time = clock();
    printf("FFS instruction.    Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_branch_free_mask(nums);
    end_time = clock();
    printf("Branch free mask.   Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_double_hack(nums);
    end_time = clock();
    printf("Double hack.        Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
trincot
  • 317,000
  • 35
  • 244
  • 286
Andrew Bainbridge
  • 4,651
  • 3
  • 35
  • 50
  • 9
    The benchmarks for both de Bruijn and lookup could be misleading - sitting in a tight loop like that, after the first operation the lookup tables for each type will be pinned in L1 cache until after the last loop. This isn't likely to match real-world usage. – MattW May 05 '15 at 18:02
  • I'm surprised the FFS isn't faster than the LUT. Your `rand()` test data will have an all-zero low byte only one in 256 times, so the branch predicts very well. On Godbolt, [gcc4.7.2's `-m32 -O2 -march=corei7`](https://godbolt.org/g/jBPxlH) inner loop goes from `.L13` to `jne .L13` when the first byte is non-zero, and should have good throughput on your Nehalem CPU, and just bottleneck on its throughput of 9 fused-domain uops, at near the theoretical max of 4 per clock. I had to trim the code down to just the two functions, else the URL was too long for godbolt to shorten with goo.gl. :/ – Peter Cordes Dec 18 '16 at 04:14
  • 1
    For the inputs with a zero in the low byte, it gets the higher bytes by storing/reloading instead of shifting, because of the pointer-cast. (totally unnecessary BTW, and makes it endian-dependent unlike a shift wouldn't). Anyway, so not only is the microbenchmark unrealistic because of hot cache, it also has the branch predictors primed and tests inputs that predict very well and make the LUT do less work. Many real use-cases have a more uniform distribution of results, not inputs. – Peter Cordes Dec 18 '16 at 04:17
  • Fun fact, with -O3 gcc auto-vectorizes some of the loops! (at least gcc6.2 does, as does clang). – Peter Cordes Dec 18 '16 at 04:17
  • 3
    Your FFS loop is unfortunately slowed down by a false dependency in the BSF instruction which your crusty old compiler doesn't avoid ([but newer gcc should, same for popcnt/lzcnt/tzcnt](http://stackoverflow.com/q/25078285/224132). `BSF` has a false dependency on its output (since the actual behaviour when input=0 is to leave the output unchanged). gcc unfortunately turns this into a loop-carried dependency by not clearing the register between loop iterations. So the loop should run at one per 5 cycles, bottlenecked on BSF(3) + CMOV(2) latency. – Peter Cordes Dec 18 '16 at 04:26
  • 1
    Your benchmark found that the LUT has almost exactly twice the throughput of the FFS method, which matches my static-analysis prediction extremely well :). Note that you are measuring throughtput, not latency, because the only serial dependency in your loop is summing into the total. **Without the false dependency, `ffs()` should have had a throughput of one per clock (3 uops, 1 for BSF and 2 for CMOV, and they can run on different ports). With the same loop overhead, it's 7 ALU uops that can run (on your CPU) at 3 per clock. Overhead dominates!** Source: http://agner.org/optimize/ – Peter Cordes Dec 18 '16 at 04:32
  • @Peter: Nice analysis. I think I understand it all. The cmove has a real dependency on the result of the bsf. But you say it can fit in the same clock cycle presumably because multiple loop iterations are in flight at the same time and the cmove is running on the previous iteration's bsf result. Have I got that right? If so, why can't the false dependency in the bsf be overcome in the same way? – Andrew Bainbridge Dec 18 '16 at 23:07
  • 1
    Yes, out-of-order execution could overlap multiple iterations of the loop if `bsf ecx, [ebx+edx*4]` didn't treat `ecx` as an input that it had to wait for. (ECX was last written by the previous iteraton's CMOV). But the CPU does behave that way, to implement the "leave dest unmodified if source is zero" behaviour (so it's not truly a false dep like it is for TZCNT; a data dependency is required because there's no branching + speculative execution on the assumption that the input is non-zero). We could overcome it by adding an `xor ecx,ecx` before the `bsf`, to break the dependency on ECX. – Peter Cordes Dec 18 '16 at 23:19
  • Or better by doing `mov ecx, [ebx+edx*4]` / `bsf ecx, ecx`. MOV unconditionally writes its destination with no dependency on the old value, so register-renaming makes this iteration's ECX independent of the previous iteration's architectural ECX. i.e. it breaks the dependency chain, just like an [xor-zeroing](http://stackoverflow.com/a/33668295/224132) would. – Peter Cordes Dec 18 '16 at 23:22
  • Oh BTW, newer compilers still don't break dependency chains for BSF, only for TZCNT (where the false dep on Intel (but not AMD) CPUs is a CPU performance pothole, since it's avoidable). That's silly. >.< See also http://codegolf.stackexchange.com/questions/4093/position-of-the-only-1-in-a-number-in-binary-format/103531#103531 for some analysis of compiler output for `__builtin_ctz()`. – Peter Cordes Dec 18 '16 at 23:25
  • 1
    The way the random inputs are distributed favors the naive loop and other methods that can exit early, because the average position of the 1st set bit is just **1**. This is unrealistic for many applications. I'd like to see a uniform distribution of *outputs* rather than *inputs* – Matt Timmermans Nov 25 '18 at 21:54
  • 1
    while I don't think it makes much of a difference, the double hack code looks off by 1? – StarShine Apr 14 '20 at 11:12
18

The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.

Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.

This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,

E.g.

byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};

unsigned GetLowestBitPos(unsigned value)
{
  // note that order to check indices will depend whether you are on a big 
  // or little endian machine. This is for little-endian
  byte* bytes = (byte*)value;
  if (bytes[0])
    return lowestBitTable[bytes[0]];
  else if (bytes[1])
      return lowestBitTable[bytes[1]] + 8;
  else if (bytes[2])
      return lowestBitTable[bytes[2]] + 16;
  else
      return lowestBitTable[bytes[3]] + 24;  
}
Andrew Grant
  • 58,260
  • 22
  • 130
  • 143
  • 1
    It's actually a worst-case of three conditionals :) But yes, this is the fastest approach (and usually what people are looking for in interview questions like this). – Brian Apr 16 '09 at 17:47
  • 4
    Don't you want a +8, +16, +24 in there somewhere? – Mark Ransom Apr 16 '09 at 17:55
  • 9
    Any lookup table increases the chance of cache miss and might incur the cost of memory access which can be several orders of magnitude higher than executing instructions. – Mehrdad Afshari Apr 16 '09 at 18:08
  • True, in theory, but then you could say the same thing about I-Cache with all of the unrolled loops in other answers. If the speed of this function is so important one can assume it's called frequently so small amounts of either data are likely to be in the cache. Either way profile and see. – Andrew Grant Apr 16 '09 at 18:10
  • @Mehrdad If this function is that important there are easy ways to insure the table either stays in cache or is prefetched prior to calling the function – Brian Apr 16 '09 at 18:13
  • 1
    i would even use bit-shifts (shifting it by 8 each time). could be done entirely using registers then. using pointers, you will have to access memory. – Johannes Schaub - litb Apr 16 '09 at 21:31
  • 1
    Reasonable solution, but between the potential for the lookup table not being in cache (which can be solved, as pointed out) and the number of branches (potential branch misprediction), I much prefer the multiply-and-lookup solution (no branches, smaller lookup table). Of course, if you can use intrinsics or inline assembly, they are probably a better choice. Still, this solution isnt bad. –  Jan 14 '11 at 18:18
  • I just measured this versus the accepted answer (DeBruijn method). This is 3x faster on my Intel i3. I was surprised, but thinking about it, for uniformly distributed values, the vast majority will have a bit set in the bottom byte. The processor predicts the first branch as taken and goes straight to the lookup table, which in a benchmark situation will be in L1 cache. With random input, the whole function executes in ~4 cycles, much less than a misprediction or cache miss penalty. – Andrew Bainbridge Feb 08 '14 at 00:45
  • For reference in C#: With masking for the conditionals and shifts and byte conversion for the lookups, the De Bruijn still has a very slight edge. C# pointers give this method the edge; however, if your inputs will likely have eight or more trailing zeros, the De Bruijn is preferred. – jnm2 Jun 10 '14 at 18:45
  • @Brian can you describe in more details the 'easy ways to insure this table either stays in cache ...' – Shmil The Cat Sep 24 '15 at 19:54
  • @AndrewBainbridge: Lookup tables usually look good in microbenchmarks because the table stays hot in cache. In real programs, for something you use mixed in with memory-intensive code, that might not be the case. At best, you're adding 4 cache lines of memory pressure that have to stay in cache to be fast. Fortunately it's nearly a moot point on most ISAs, for languages that provide a portable intrinsic that can compile to x86 `bsf` or `tzcnt`, or ARM `rbit`/`clz`, leaving only a few ISAs like RISC-V without extension B without fast hardware support for this primitive operation. – Peter Cordes Oct 03 '22 at 04:15
16

Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.

Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.

The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.

The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.

Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.

Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.

My apologies if this was already covered.

Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.

For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.

Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.

Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:

unsigned BitScanLow_BranchFree(unsigned value)
{
    bool bwl = (value & 0x0000ffff) == 0;
    unsigned I1 = (bwl * 15);
    value = (value >> I1) & 0x0000ffff;
    
    bool bbl = (value & 0x00ff00ff) == 0;
    unsigned I2 = (bbl * 7);
    value = (value >> I2) & 0x00ff00ff;

    bool bnl = (value & 0x0f0f0f0f) == 0;
    unsigned I3 = (bnl * 3);
    value = (value >> I3) & 0x0f0f0f0f;

    bool bsl = (value & 0x33333333) == 0;
    unsigned I4 = (bsl * 1);
    value = (value >> I4) & 0x33333333;

    unsigned result = value + I1 + I2 + I3 + I4 - 1;

    return result;
}

The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.

Edit:

The code above is totally broken. This code works and is still branch-free (if optimized):

int BitScanLow_BranchFree(ui value)
{
    int i16 = !(value & 0xffff) << 4;
    value >>= i16;

    int i8 = !(value & 0xff) << 3;
    value >>= i8;

    int i4 = !(value & 0xf) << 2;
    value >>= i4;

    int i2 = !(value & 0x3) << 1;
    value >>= i2;

    int i1 = !(value & 0x1);

    int i0 = (value >> i1) & 1? 0 : -32;

    return i16 + i8 + i4 + i2 + i1 + i0;
}

This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.

trincot
  • 317,000
  • 35
  • 244
  • 286
Dan
  • 2,460
  • 2
  • 18
  • 12
  • 4
    I fixed it for you. Be sure to test what you post. – Jim Balter May 19 '13 at 04:15
  • 6
    How can you call it "branch-free" when it includes a ternary operator in there? – BoltBait Feb 22 '16 at 20:26
  • 4
    Its a Conditional Move. A single Assembly language instruction that takes both possible values as parameters, and performs a mov operation based on the evaluation of the conditional. And thus is "Branch Free". there is no jump to another unknown or possibly incorrect address. – Dan Mar 17 '16 at 00:35
  • FWIW gcc generates branches even on `-O3` https://godbolt.org/z/gcsUHd – Qix - MONICA WAS MISTREATED May 17 '20 at 04:27
8

After 11 years we finally have countr_zero!

#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
 
int main()
{
    for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
        std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
                  << std::countr_zero(i) << '\n';
    }
}

Well done C++20

Ted Klein Bergman
  • 9,146
  • 4
  • 29
  • 50
Surt
  • 15,501
  • 3
  • 23
  • 39
7

Inspired by this similar post that involves searching for a set bit, I offer the following:

unsigned GetLowestBitPos(unsigned value)
{
   double d = value ^ (value - !!value); 
   return (((int*)&d)[1]>>20)-1023; 
}

Pros:

  • no loops
  • no branching
  • runs in constant time
  • handles value=0 by returning an otherwise-out-of-bounds result
  • only two lines of code

Cons:

  • assumes little endianness as coded (can be fixed by changing the constants)
  • assumes that double is a real*8 IEEE float (IEEE 754)

Update: As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:

unsigned GetLowestBitPos(unsigned value)
{
    union {
        int i[2];
        double d;
    } temp = { .d = value ^ (value - !!value) };
    return (temp.i[1] >> 20) - 1023;
}

This assumes 32-bit ints with little-endian storage for everything (think x86 processors).

DocMax
  • 12,094
  • 7
  • 44
  • 44
  • 1
    Interesting - I am still scared to use doubles for bit arithmetics, but I'll keep it in mind – peterchen Apr 20 '09 at 08:02
  • Using frexp() might make it a bit more portable – aka.nice Jun 28 '12 at 23:06
  • 1
    Type-punning by pointer-casting is not safe in C or C++. Use memcpy in C++, or a union in C. (Or a union in C++ if your compiler guarantees it's safe. For example, the GNU extensions to C++ (supported by many compilers) do guarantee union type-punning is safe.) – Peter Cordes Aug 03 '17 at 11:57
  • 1
    Older gcc also makes better code with a union instead of a pointer-cast: it moves directly from an FP reg (xmm0) to rax (with movq) instead of storing/reloading. Newer gcc and clang use movq for both ways. See https://godbolt.org/g/x7JBiL for a union version. Is it intentional that you're doing an arithmetic shift by 20? Your assumptions should also list that `int` is `int32_t`, and that signed right shift is an arithmetic shift (in C++ it's implementation-defined) – Peter Cordes Aug 03 '17 at 12:11
  • BTW, clang makes some interesting asm for the integer part of that, using `cmp value, 1` / `adc value, -1` to evaluate `value - !!value`. gcc uses test/setcc/sub. – Peter Cordes Aug 03 '17 at 12:14
  • @PeterCordes, thanks for pointing out the union approach. It is not only more sane (to the extent that "sane" belongs in a discussion about using floating-point numbers for integer bit operations), it looks cleaner as well. The shift is by 20 because taking the second item in the int array skips the low 32 bits of the 52-bit mantissa (leaving only 20 more to shift). Because the source is unsigned, the double will be positive and I do not believe I have implementation concerns on the shift. – DocMax Aug 05 '17 at 00:29
  • 1
    Also BTW, Visual Studio (2013 at least) also uses the test/setcc/sub approach. I like the cmp/adc better myself. – DocMax Aug 05 '17 at 00:30
  • @DocMax: Ah yes, you're right about the integer always being non-negative, so there's no implementation-defined behaviour. I forgot you needed an unsigned -> `double` conversion. Fortunately x86 can do that by zero-extending to a 64-bit integer and using a signed 64-bit integer->double conversion. (No direct unsigned->double conversions until AVX512). Anyway, I think that means this code is portable to one's complement and sign/magnitude machines, but still dependent on FP endianness. (Apparently FP endianness doesn't have to match integer endianness). – Peter Cordes Aug 05 '17 at 01:49
  • Also I think the C standard has more to say about how the bits are laid out in an `unsigned` than in a `signed`, so that would be a good reason to use an `unsigned` type. – Peter Cordes Aug 05 '17 at 01:50
  • `cmp/adc` is at least as good on current CPUs, and better on many. (AMD, and Intel since Broadwell). Actually, the `test/setcc` has an extra xor-zeroing instruction, so even on Intel pre-Broadwell where `adc` is 2 uops, it's still probably better (unless a multi-uop instruction hits the decoders at an unfortunate spot, on CPUs without a uop cache). – Peter Cordes Aug 05 '17 at 01:53
  • Since this is a C++ question, the canonical way to type-pun is now C++20 `std::bit_cast(some_double)`. Without C++20, `std::memcpy` is the only way that's guaranteed well-defined. Union for type-punning in C++ are only supported as a GNU extension, and by MSVC. Pointer casting is only supported by MSVC, or `gcc -fno-strict-aliasing`, so you really don't want that at the first part of your answer. It's not portable for C or C++, only one mainstream C++ compiler. – Peter Cordes Oct 03 '22 at 04:26
  • Of course if you have C++20, you'd normally use `std::countr_zeros` unless you wanted to hand-hold the compiler toward this. – Peter Cordes Oct 04 '22 at 10:09
  • 1
    There's no advantage to `double` here; `float` could auto-vectorize better, and be more efficient on 32-bit machines. float32 can exactly represent any power-of-2 in a wider range than `unsigned` (or `uint64_t`). Although if you want auto-vectorization on x86, you'd want to be doing signed-`int` to `float` conversion, and masking away the sign bit (Fortunately unsigned `2^31` has the same magnitude as signed `-2^31`, so you can use packed int -> FP conversion which is only available for signed, until AVX-512. See also [Trying to write a vectorized BSF](https://stackoverflow.com/q/73930875)) – Peter Cordes Oct 04 '22 at 10:14
  • I did not expect this much traction on a 13 year old answer but I do like that you're still catching improvements in already unorthodox approach . – DocMax Oct 04 '22 at 17:23
5

It can be done with a worst case of less than 32 operations:

Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.

So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.

So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...

Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.

The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.

Note: I write it out in full instead of using a loop because it's more efficient that way.

int getLowestBitPos(unsigned int value)
{
    //Group 1: Bits 0-3
    if(value&0xf)
    {
        if(value&0x1)
            return 0;
        else if(value&0x2)
            return 1;
        else if(value&0x4)
            return 2;
        else
            return 3;
    }

    //Group 2: Bits 4-7
    if(value&0xf0)
    {
        if(value&0x10)
            return 4;
        else if(value&0x20)
            return 5;
        else if(value&0x40)
            return 6;
        else
            return 7;
    }

    //Group 3: Bits 8-11
    if(value&0xf00)
    {
        if(value&0x100)
            return 8;
        else if(value&0x200)
            return 9;
        else if(value&0x400)
            return 10;
        else
            return 11;
    }

    //Group 4: Bits 12-15
    if(value&0xf000)
    {
        if(value&0x1000)
            return 12;
        else if(value&0x2000)
            return 13;
        else if(value&0x4000)
            return 14;
        else
            return 15;
    }

    //Group 5: Bits 16-19
    if(value&0xf0000)
    {
        if(value&0x10000)
            return 16;
        else if(value&0x20000)
            return 17;
        else if(value&0x40000)
            return 18;
        else
            return 19;
    }

    //Group 6: Bits 20-23
    if(value&0xf00000)
    {
        if(value&0x100000)
            return 20;
        else if(value&0x200000)
            return 21;
        else if(value&0x400000)
            return 22;
        else
            return 23;
    }

    //Group 7: Bits 24-27
    if(value&0xf000000)
    {
        if(value&0x1000000)
            return 24;
        else if(value&0x2000000)
            return 25;
        else if(value&0x4000000)
            return 26;
        else
            return 27;
    }

    //Group 8: Bits 28-31
    if(value&0xf0000000)
    {
        if(value&0x10000000)
            return 28;
        else if(value&0x20000000)
            return 29;
        else if(value&0x40000000)
            return 30;
        else
            return 31;
    }

    return -1;
}
Brian R. Bondy
  • 339,232
  • 124
  • 596
  • 636
  • +1 from me. It's not the fastest but it's faster than the original, which was the point... – Andrew Grant Apr 16 '09 at 18:02
  • @onebyone.livejournal.com: Even if there was a bug in the code, the concept of grouping is the point I was trying to get across. The actual code sample doesn't matter much, and it could be made more compact but less efficient. – Brian R. Bondy Apr 16 '09 at 18:10
  • I'm just wondering if there is a real bad part of my answer, or if people didn't just like that I wrote it out in full? – Brian R. Bondy Apr 16 '09 at 18:11
  • @onebyone.livejournal.com: When you compare 2 algorithms, you should compare them as they are, not assuming that one will be magically transformed by an optimization phase. I never claimed my algorithm was "faster" either. Only that it is less operations. – Brian R. Bondy Apr 16 '09 at 19:08
  • @onebyone.livejournal.com: ... I don't need to profile the above code to know it is less operations. I can see that clearly. I never made any claims that require profiling. – Brian R. Bondy Apr 16 '09 at 19:08
  • I'm assuming that people don't like the branching on data that's likely hard to predict, which makes it slow for random inputs. Not all operations cost the same in modern CPUs, with branches being a different sort of cost from ALU work. The large number of 32-bit constants to test each single bit will also take multiple instructions each to construct on some RISCs, although ARM Thumb and AArch64 can use them directly as immediates. On other ISAs like MIPS or RISC-V, hopefully a compiler would turn those into a left shift so it can branch on the bit directly when it's the MSB (sign bit). – Peter Cordes Oct 04 '22 at 10:22
4

Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):

if (0x0000FFFF & value) {
    if (0x000000FF & value) {
        if (0x0000000F & value) {
            if (0x00000003 & value) {
                if (0x00000001 & value) {
                    return 1;
                } else {
                    return 2;
                }
            } else {
                if (0x0000004 & value) {
                    return 3;
                } else {
                    return 4;
                }
            }
        } else { ...
    } else { ...
} else { ...
soulmerge
  • 73,842
  • 19
  • 118
  • 155
  • +1 This is very similar to my answer. The best case run time is worse than my suggestion, but the worst case run time is better. – Brian R. Bondy Apr 16 '09 at 18:24
2

According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.

(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)

Here is a 64-bit subtract and xor:

unsigned long v;  // find the number of trailing zeros in 64-bit v 
int r;            // result goes here
static const int MultiplyDeBruijnBitPosition[64] = 
{
  0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
  54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
  46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
  25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];

For reference, here is a 64-bit version of the negate and mask method:

unsigned long v;  // find the number of trailing zeros in 64-bit v 
int r;            // result goes here
static const int MultiplyDeBruijnBitPosition[64] = 
{
  0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
  62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
  63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
  46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
jnm2
  • 7,960
  • 5
  • 61
  • 99
  • 1
    This `(v ^ (v-1))` works provided `v != 0`. In case of `v == 0` it returns 0xFF....FF whilst `(v & -v)` gives zero (which by the way is wrong, too, buf at least it leads to a reasonable result). – CiaPan Jun 12 '14 at 11:37
  • @CiaPan: That's a good point, I'll mention it. I am guessing there is a different De Bruijn number that would resolve this by putting 0 in the 63rd index. – jnm2 Jun 12 '14 at 11:54
  • Duh, that's not where the issue is. 0 and 0x8000000000000000 both result in 0xFFFFFFFFFFFFFFFF after `v ^ (v-1)`, so there is no telling them apart. In my scenario, zero will never be input. – jnm2 Jun 12 '14 at 17:13
  • `v ^ (v-1)` can compile to fewer instructions on x86, using `lea` to copy-and-add a constant -1. vs. `v & -v` having to copy a register, negate the copy, then `and`, because x86 doesn't have a way to copy-and-negate. (Except in SIMD registers with AVX, then `vpsubq dst, src1, src2` works. But you can only efficiently vectorize the DeBruijn strategy for `uint16_t`, unless you have AVX-512VBMI for `vpermb` lookup tables with 32 or 64 entries, not just 16 [where it's useful](//stackoverflow.com/q/73930875/224132). But then you'd also have AVX512 popcnt for `vpopcntq( v ^ (v-1) )` of the lsmsk – Peter Cordes Oct 04 '22 at 10:31
2

Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.

Magic masks:

m0 = (...............01010101)  
m1 = (...............00110011)
m2 = (...............00001111)  
m3 = (.......0000000011111111)
....

Key idea: No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...

int lastSetBitPos(const uint64_t x) {
    if (x == 0)  return -1;

    //For 64 bit number, log2(64)-1, ie; 5 masks needed
    int steps = log2(sizeof(x) * 8); assert(steps == 6);
    //magic masks
    uint64_t m[] = { 0x5555555555555555, //     .... 010101
                     0x3333333333333333, //     .....110011
                     0x0f0f0f0f0f0f0f0f, //     ...00001111
                     0x00ff00ff00ff00ff, //0000000011111111 
                     0x0000ffff0000ffff, 
                     0x00000000ffffffff };

    //Firstly extract only the last set bit
    uint64_t y = x & -x;

    int trailZeros = 0, i = 0 , factor = 0;
    while (i < steps) {
        factor = ((y & m[i]) == 0 ) ? 1 : 0;
        trailZeros += factor * pow(2,i);
        ++i;
    }
    return (trailZeros+1);
}
jayadev
  • 3,576
  • 25
  • 13
  • The one issue I see with this one is that log2() is an expensive operation. Using a lookup table for that calculation would almost definitely be worth it. – mczarnek Oct 06 '20 at 15:58
  • @mczarnek: `log2` is only called on a compile-time constant. (Which should be `sizeof(8) * CHAR_BIT`, not hard-coding 8, but that still assumes no padding). Anyway, most compilers will do constant-propagation through `log2` so `steps` is effectively a compile-time constant. But it would be better to just hard-code `steps = 6` and `static_assert(numeric_limits::digits == 64)` or something, or perhaps round `digits` up to the nearest power of 2 if you want to support odd-sized types. (`uint64_t` is guaranteed to be a 64-bit type with no padding, so no check needed for it.) – Peter Cordes Oct 03 '22 at 04:20
  • @mczarnek: The part that's ridiculous here is `factor * pow(2,i)` instead of `factor << i`!! I wouldn't count on compilers optimizing that for you, although some *might*. – Peter Cordes Oct 03 '22 at 04:21
2

Another method (modulus division and lookup) deserves a special mention here from the same link provided by @anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.

modulus division and lookup

 unsigned int v;  // find the number of trailing zeros in v
    int r;           // put the result in r
    static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
    {
      32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
      7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
      20, 8, 19, 18
    };
    r = Mod37BitPosition[(-v & v) % 37];

modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.

test:-

unsigned int n1=0x00000000, n2=0xFFFFFFFF;

MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
RaviSharma
  • 116
  • 3
  • 1
    `mod` is slow. Instead, you can use the original multiply-and-lookup method and subtract `!v` from `r` to handle the edge cases. – Eitan T Jul 11 '13 at 22:41
  • 3
    @EitanT an optimizer may well transform that mod into a fast multiplication like in hackers' delight – phuclv Jun 12 '14 at 06:05
  • Modulo using a multiplicative inverse will involve a multiply & shift to get the quotient, then multiply and subtract to get `remainder = x - (x/y)*y`. Using the DeBruijn multiply/shift directly is faster; if you want to return a special value for `0` (like `32` or `-1`), you can do that with a ternary after the lookup. (e.g. an x86 `cmov`, still cheaper than an extra multiply and subtract.) – Peter Cordes Oct 04 '22 at 02:06
1

Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)

uint32 x = ...;  // 0x00000001  0x0405a0c0  0x00602000
x |= x <<  1;    // 0x00000003  0x0c0fe1c0  0x00e06000
x |= x <<  2;    // 0x0000000f  0x3c3fe7c0  0x03e1e000
x |= x <<  4;    // 0x000000ff  0xffffffc0  0x3fffe000
x |= x <<  8;    // 0x0000ffff  0xffffffc0  0xffffe000
x |= x << 16;    // 0xffffffff  0xffffffc0  0xffffe000

// now x is filled with '1' from the least significant '1' to bit 31

x = ~x;          // 0x00000000  0x0000003f  0x00001fff

// now we have 1's below the original least significant 1
// let's count them

x = x & 0x55555555 + (x >>  1) & 0x55555555;
                 // 0x00000000  0x0000002a  0x00001aaa

x = x & 0x33333333 + (x >>  2) & 0x33333333;
                 // 0x00000000  0x00000024  0x00001444

x = x & 0x0f0f0f0f + (x >>  4) & 0x0f0f0f0f;
                 // 0x00000000  0x00000006  0x00000508

x = x & 0x00ff00ff + (x >>  8) & 0x00ff00ff;
                 // 0x00000000  0x00000006  0x0000000d

x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
                 // 0x00000000  0x00000006  0x0000000d
// least sign.bit pos. was:  0           6          13
CiaPan
  • 9,381
  • 2
  • 21
  • 35
1

If C++11 is available for you, a compiler sometimes can do the task for you :)

constexpr std::uint64_t lssb(const std::uint64_t value)
{
    return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}

Result is 1-based index.

Ruslan Garipov
  • 530
  • 4
  • 13
  • 1
    Clever, but it compiles to catastrophically bad assembly when the input is not a compile-time constant. https://godbolt.org/g/7ajMyT. (A dumb loop over the bits with gcc, or an actual recursive function call with clang.) gcc/clang can evaluate `ffs()` at compile time, so you don't need to use this for constant-propagation to work. (You do have to avoid inline-asm, of course.) If you really do need something that works as a C++11 `constexpr`, you can still use GNU C `__builtin_ffs`. – Peter Cordes Aug 03 '17 at 12:26
1

You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:

32bit int - check if any of the first 16 are set. If so, check if any of the first 8 are set. if so, ....

if not, check if any of the upper 16 are set..

Essentially it's binary search.

Shea
  • 11,085
  • 2
  • 19
  • 21
1

See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.

Community
  • 1
  • 1
timday
  • 24,582
  • 12
  • 83
  • 135
0

This is in regards of @Anton Tykhyy answer

Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:

constexpr uint32_t DeBruijnSequence[32] =
{
    0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
    31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
    return  DeBruijnSequence[ 
        (( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
            >> 27];
}

To get around the issue of 0x1 and 0x0 both returning 0 you can do:

constexpr uint32_t ffs ( uint32_t value )
{
    return (!value) ? 32 : DeBruijnSequence[ 
        (( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
            >> 27];
}

but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.

Finally, if interested, here's a list of static asserts to check that the code does what is intended to:

static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
  • Your cast `-static_cast(value)` actually introduces undefined behaviour where `-value` would be safe. For `value = 0x80000000`, the cast gives you `INT32_MIN` as a *signed* integer, and then `-` on that has signed overflow UB before implicit conversion back to unsigned. Unary `-` works on unsigned values with well-defined binary wrapping semantics. – Peter Cordes Oct 04 '22 at 01:39
-1

Here is one simple alternative, even though finding logs is a bit costly.

if(n == 0)
  return 0;
return log2(n & -n)+1;   //Assuming the bit index starts from 1
Siva Prakash
  • 4,626
  • 34
  • 26
-1
unsigned GetLowestBitPos(unsigned value)
{
    if (value & 1) return 1;
    if (value & 2) return 2;
    if (value & 4) return 3;
    if (value & 8) return 4;
    if (value & 16) return 5;
    if (value & 32) return 6;
    if (value & 64) return 7;
    if (value & 128) return 8;
    if (value & 256) return 9;
    if (value & 512) return 10;
    if (value & 1024) return 11;
    if (value & 2048) return 12;
    if (value & 4096) return 13;
    if (value & 8192) return 14;
    if (value & 16384) return 15;
    if (value & 32768) return 16;
    if (value & 65536) return 17;
    if (value & 131072) return 18;
    if (value & 262144) return 19;
    if (value & 524288) return 20;
    if (value & 1048576) return 21;
    if (value & 2097152) return 22;
    if (value & 4194304) return 23;
    if (value & 8388608) return 24;
    if (value & 16777216) return 25;
    if (value & 33554432) return 26;
    if (value & 67108864) return 27;
    if (value & 134217728) return 28;
    if (value & 268435456) return 29;
    if (value & 536870912) return 30;
    if (value & 1073741824) return 31;
    return 0; // no bits set
}

50% of all numbers will return on the first line of code.

75% of all numbers will return on the first 2 lines of code.

87% of all numbers will return in the first 3 lines of code.

94% of all numbers will return in the first 4 lines of code.

97% of all numbers will return in the first 5 lines of code.

etc.

Think about how the compiler will translate this into ASM!

This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!

I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.

BoltBait
  • 11,361
  • 9
  • 58
  • 87
  • 5
    And a worst-case of 32 branch misprediction :) –  Jan 14 '11 at 18:32
  • 1
    Couldn't this *at least* be made into a switch...? – Steven Lu Dec 09 '12 at 00:45
  • "Couldn't this at least be made into a switch...?" Did you try to do that before implying it's possible? Since when you can do calculations right on the cases of a switch? It's a lookup table, not a class. – j riv Jul 05 '18 at 11:21
  • fails on 0: returns 31 instead of 0 – johan d Feb 25 '22 at 20:49
  • @johan-d According to the specifications (question), zero is a special case that will be handled elsewhere. – BoltBait Feb 28 '22 at 18:12
  • fair enough. I think that assert would be a nice addition to your code as people will come here to copy paste an answer and could get bite. As-is, the function doesnt do what its name implies. – johan d Mar 29 '22 at 21:01
  • @johand I fixed the code to address your concerns. Enjoy! – BoltBait Mar 29 '22 at 21:46
  • @jriv: You could turn it into a `switch( value & -value )` with ... `case 32:` / `case 64:` ... Not that that would be better; in fact probably worse by the time the compiler is making asm comparing for exact equality to large constants instead of just testing single bits. – Peter Cordes Oct 04 '22 at 10:34
  • This answer might be ok if you expect your inputs to be uniformly distributed integers, but **often bit-scan is useful in cases where the *result* is somewhat close to uniformly distributed.** (There are some use-cases where very small results are the norm). But even one or two unpredictable branches is pretty costly (the best case for this strategy other than constant results). Fortunately it's a moot point now that std::countr_zero exists, giving portable access to an efficient primitive on machines that have one (which is most current mainstream CPUs except RISC-V without extension B). – Peter Cordes Oct 04 '22 at 10:45
-4

recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..

The logic is simply "value & -value", suppose you have 0x0FF0, then, 0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)

sean
  • 51
  • 6
-9

If you have the resources, you can sacrifice memory in order to improve the speed:

static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };

unsigned GetLowestBitPos(unsigned value)
{
    assert(value != 0); // handled separately
    return bitPositions[value];
}

Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).

If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

e.James
  • 116,942
  • 41
  • 177
  • 214
  • The bitpositions array need only be the same size as the max possible value. So as long as you restrict the input, you're fine :) – Stefan Kendall Apr 16 '09 at 17:23
  • 2
    The range of the input is already specifed by the paramater type - 'unsigned' is a 32-bit value so no, you are not fine. – Brian Apr 16 '09 at 17:33
  • @Andrew Grant: Yes, or even more, if the size of unsigned uses more bits on a system. The OP asked how to reduce cycles, and this will do it. Like any design decision, it involves some tradeoffs. In some systems, a few cycles saved is worth 4GB of RAM. – e.James Apr 16 '09 at 17:47
  • 3
    umm... does your mythical system and OS have a concept of paged memory? How much time is that going to cost? – Mikeage Apr 16 '09 at 17:56
  • @Mikeage: Good point. Memory paging is definitely a valid concern. – e.James Apr 16 '09 at 18:00
  • 15
    This is a non-answer. Your solution is completely unrealistic in ALL real-world applications and calling it a "tradeoff" is disingenuous. Your mythical system that has 16GB of ram to devote to a single function just does not exist. You'd have been as well answering "use a quantum computer". – Brian Apr 16 '09 at 18:08
  • 5
    Sacrifice memory for speed? A 4GB+ lookup table will never ever fit in cache on any currently existing machine, so I'd imagine this is probably slower than almost all other answers here. –  Jan 14 '11 at 18:34
  • 1
    Argh. This horrible answer keeps on haunting me `:)` @Dan: You are correct about memory caching. See Mikeage's comment above. – e.James Jan 14 '11 at 23:28
  • Heh, sorry for making it continue to haunt you :) Anyway, I meant the processor cache (L1, L2, L3), not paging (unless I'm not seeing the comment you are referring to?) –  Jan 17 '11 at 03:57
  • @Dan: No worries at all. I should have realized that you were talking about processor cache. Please disregard! – e.James Jan 17 '11 at 04:35
  • 1
    Why assert(value != 0)? This answer is already absurd enough; and it handles value == 0. Indeed, GetLowestBitPos should probably be a macro. I would love to see someone benchmark this. The .o file is probably enormous. – Jeffrey Aguilera Jun 16 '14 at 22:39
  • 2
    Sorry to haunt you more, but I am amused by all the performance discussion, while ignoring that you just specified a source file that will take around ten gigs. I would love to see how the compiler reacts. That 4GB table is also going to be included in the genetrated executable. I'm tempted to suggest an edit to use char for the table data type with the explanation that it only needs one gig that way. :-P – Katie Jan 21 '15 at 15:42
  • Leaving aside that this is not useful for anything, the description isn't even right. The function return type should be `int` or `unsigned`, but the *table* should be `unsigned char`, so the function involves a zero-extending load. Unless you're trying to trade even more space for "speed" by allowing it to be a simpler load, so on x86 it can maybe fold that load into a memory source operand, e.g. for `add esi, [ctz_lut + rdi*4]`, instead of a separate `movzx eax, byte [ctz_lut + rdi]` / whatever with the extended value. – Peter Cordes Oct 03 '22 at 04:36
  • Also, what's `MAX_INT`? If you mean `INT_MAX` from `limits.h`, that's 0x7fffffff on typical 32-bit systems, the largest non-negative `int`. As an array size, that means `GetLowestBitPos(MAX_INT)` will access out of bounds. And so will all negative integers. You actually need `static uint_least8_t lut[UINT_MAX + (size_t)1]`, or in C++ make it a `static const std::array = init_function()` (to not put the storage in `.rodata` in the executable, and not need an initializer in the source file, as earlier comments point out.) – Peter Cordes Oct 03 '22 at 04:41
  • Since you've realized this is a bad idea, you could rewrite this answer to say something like "you might think a lookup table would be fast, but it's actually an example of what *not* to do" Then summarize the problems from comments. That or delete the answer; the only value here is for novices that don't realize how bad this. It's not a good idea even for 16-bit integers. – Peter Cordes Oct 03 '22 at 04:44