39

Just say I have a value of type uint64_t seen as sequence of octets (1 octet = 8-bit). The uint64_t value is known containing only one set bit at a MSB position. Thus, the uint64_t value can be in one of the following binary representations:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 10000000  pos = 7
00000000 00000000 00000000 00000000 00000000 00000000 10000000 00000000  pos = 15
00000000 00000000 00000000 00000000 00000000 10000000 00000000 00000000  pos = 23
00000000 00000000 00000000 00000000 10000000 00000000 00000000 00000000  pos = 31
00000000 00000000 00000000 10000000 00000000 00000000 00000000 00000000  pos = 39
00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000  pos = 47
00000000 10000000 00000000 00000000 00000000 00000000 00000000 00000000  pos = 55
10000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000  pos = 63

I need a fast function that returns the set bit position, but returns 0 if there is no bit that is set.

If possible, I want it without neither looping nor branching.

Aeoliyan
  • 835
  • 8
  • 13
  • Are you willing to use compiler-intrinsics? – Mysticial Sep 01 '15 at 19:04
  • @Mysticial - I prefer to not use the compiler-intrinsics so that I can port the solution to another programming language if needed later. – Aeoliyan Sep 01 '15 at 19:07
  • 2
    A couple of algorithms are described here: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog – khachik Sep 01 '15 at 19:10
  • You are looking for a perfect hash function that maps your input to 8 buckets. – Captain Giraffe Sep 01 '15 at 19:10
  • @Mysticial - However, just to show it can solved using compiler-intrinsics is OK for me. – Aeoliyan Sep 01 '15 at 19:10
  • @Captain - Actually, the solution is intended to be used in a function that searches for a character. The function reads 8 bytes at a time in x64. – Aeoliyan Sep 01 '15 at 19:14
  • 5
    On x86, there are instructions `BSF` (Bit Scan Forward) and `BSR` (Bit Scan Reverse) that will give you the position of the first set bit from either end. (In your case, it doesn't matter.) Most major compilers have intrinsics for it. So it will depend on which compiler you're using. – Mysticial Sep 01 '15 at 19:15
  • @Mysticial - I program in C, Delphi, Java, and PHP. For example, Delphi does not have _built-in functions_ such as in C. – Aeoliyan Sep 01 '15 at 19:19
  • 2
    Is this a C or a C++ question? Please choose one. You might also want to specify an operating system. – fuz Sep 01 '15 at 19:28
  • See also [Bit twiddling hacks](https://graphics.stanford.edu/~seander/bithacks.html) – Thomas Matthews Sep 01 '15 at 19:32
  • @FUZxxl - I think the question will fit both C and C++. Is it prohibited to do so? I'm expecting a portable solution. – Aeoliyan Sep 01 '15 at 19:36
  • @Thomas Matthews - I have already read the _Bit Twiddling Hacks_ page. I realize the log2 solution can be used, but for a value that is known containing only one _set bit_ at a MSB position, I guess there is should be a simpler and faster way to solve my problem. I just guess :-). – Aeoliyan Sep 01 '15 at 19:40
  • @Aeoliyan C and C++ are different languages. Please pick the one you want an answer for. Unless your question is specifically about the differences between two languages, you should *not* ask for answers in multiple languages. – fuz Sep 01 '15 at 19:41
  • 1
    @Aeoliyan A call to `ffs()` compiles to four instructions on amd64 Linux, up to three of which are not emitted when you use `ffs()` in certain ways. I don't think there is a faster approach. – fuz Sep 01 '15 at 19:42
  • @FUZxxl - OK, the tag is updated as your suggestion. – Aeoliyan Sep 01 '15 at 19:42
  • @Aeoliyan Thank you for your cooperation. – fuz Sep 01 '15 at 19:43
  • 4
    @FUZxxl Yes C and C++ are different languages. But there are things where it is possible to produce an answer that works for both languages. If that's what the OP wants then leave it be. – Mysticial Sep 01 '15 at 19:52
  • 5
    @Mysticial More often than not OP *does not* want this and instead just hopes to attract a larger crowd by specifying both C and C++ in the hope that the resulting answer can be twisted into something useful for whatever language OP uses. This is known as *tag spamming.* Quite a few times I prepared a detailed answer to a “C or C++” question just to be told that OP uses C++ and my answer doesn't work because it's C specific. I'm sick of that shit. – fuz Sep 01 '15 at 19:55
  • Noting that the position of a single bit in an unsigned integer is directly related to log in base 2, you might want to look at a few of the options presented at http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup. The techniques are mainly presented for 32 bit values, but can be readily extended for 64 bit. – Peter Sep 08 '15 at 12:04
  • As you described, `f(0) = f(1) = 0`. Are you sure this is the behavior you want? – KevinZ Sep 08 '15 at 18:55

9 Answers9

45

Multiply the value by a carefully designed 64-bit constant, then mask off the upper 4 bits. For any CPU with fast 64-bit multiplication, this is probably as optimal as you can get.

int field_set(uint64_t input) {
    uint64_t field = input * 0x20406080a0c0e1ULL;
    return (field >> 60) & 15;
}

// field_set(0x0000000000000000ULL) = 0
// field_set(0x0000000000000080ULL) = 1
// field_set(0x0000000000008000ULL) = 2
// field_set(0x0000000000800000ULL) = 3
// field_set(0x0000000080000000ULL) = 4
// field_set(0x0000008000000000ULL) = 5
// field_set(0x0000800000000000ULL) = 6
// field_set(0x0080000000000000ULL) = 7
// field_set(0x8000000000000000ULL) = 8

clang implements this in three x86_64 instructions, not counting the frame setup and cleanup:

_field_set:
    push   %rbp
    mov    %rsp,%rbp
    movabs $0x20406080a0c0e1,%rax
    imul   %rdi,%rax
    shr    $0x3c,%rax
    pop    %rbp
    retq

Note that the results for any other input will be pretty much random. (So don't do that.)

I don't think there's any feasible way to extend this method to return values in the 7..63 range directly (the structure of the constant doesn't permit it), but you can convert the results to that range by multiplying the result by 7.


With regard to how this constant was designed: I started with the following observations:

  • Unsigned multiplication is a fast operation on most CPUs, and can have useful effects. We should use it. :)
  • Multiplying anything by zero results in zero. Since this matches with the desired result for a no-bits-set input, we're doing well so far.
  • Multiplying anything by 1ULL<<63 (i.e, your "pos=63" value) can only possibly result in the same value, or zero. (It cannot possibly have any lower bits set, and there are no higher bits to change.) Therefore, we must find some way for this value to be treated as the correct result.
  • A convenient way of making this value be its own correct result is by right-shifting it by 60 bits. This shifts it down to "8", which is a convenient enough representation. We can proceed to encode the other outputs as 1 through 7.
  • Multiplying our constant by each of the other bit fields is equivalent to left-shifting it by a number of bits equal to its "position". The right-shift by 60 bits causes only the 4 bits to the left of a given position to appear in the result. Thus, we can create all of the cases except for one as follows:

     uint64_t constant = (
          1ULL << (60 - 7)
        | 2ULL << (60 - 15)
        | 3ULL << (60 - 23)
        | 4ULL << (60 - 31)
        | 5ULL << (60 - 39)
        | 6ULL << (60 - 47)
        | 7ULL << (60 - 55)
     );
    

So far, the constant is 0x20406080a0c0e0ULL. However, this doesn't give the right result for pos=63; this constant is even, so multiplying it by that input gives zero. We must set the lowest bit (i.e, constant |= 1ULL) to get that case to work, giving us the final value of 0x20406080a0c0e1ULL.

Note that the construction above can be modified to encode the results differently. However, the output of 8 is fixed as described above, and all other output must fit into 4 bits (i.e, 0 to 15).

  • 3
    ExcellentI but why it works? How do you get the `0x20406080a0c0e1ULL`? – Aeoliyan Sep 02 '15 at 15:15
  • 1
    I'm amazed. Now I know that your solution come from a deep thought. I hope I can adapt the technique to my other problems. Thank you very much. – Aeoliyan Sep 03 '15 at 04:23
  • 1
    @duskwuff Your approach can be extended in a straightforward fashion to directly return the bit positions of 0, 7, ..., 63. See the late update in my answer. – njuffa Sep 03 '15 at 06:53
  • 1) The last `1` seems derivable from `| 8 << (60 - 63)` or `| 8 >> 3`. 2) The `& 15` is not needed. – chux - Reinstate Monica Sep 09 '15 at 23:10
  • 2
    @chux Correct. (And indeed, the `&` isn't present in the compiled version.) I've primarily left it in for clarity. –  Sep 09 '15 at 23:47
  • Both this answer and @njuffa 's are outstanding. While njuffa's answer gave the final answer that OP was expecting, it was based on duskwuff's idea of constructing the proper magic number to reduce the operation to just a multiplication and a shift (or two). +200 Well done! – dbush Sep 11 '15 at 12:15
  • On Intel CPUs, `bsf %rdi,%rdi` / `sub $7, %edi` will be faster, avoiding the `movabs` of the 64-bit [De Bruijn constant](https://www.chessprogramming.org/index.php?title=BitScan). ( `bsf` and `imul` have the same performance on Intel, both being 1 uop for port 0, with 3 cycle latency. https://uops.info/). BSF is slower on AMD, but encoding it as `rep bsf` lets it run as `tzcnt` on CPUs that know that instruction, which is fast on AMD. Anyway, if you use `return __builtin_ctzll(x) - 7` , the compiler should take care of those micro-optimization details for you. – Peter Cordes Oct 20 '22 at 23:48
  • Or wait, this wants `7` for `1<<7`, so it is just `__builtin_ctzll` for a few values not including zero. So `bsf` / `tzcnt` will be faster. So use that or C++20 `std::countr_zeros` – Peter Cordes Oct 20 '22 at 23:50
  • Re: the magic constant: [Position of least significant bit that is set](https://stackoverflow.com/q/757059) and other SO Q&As about De Bruijn sequences. – Peter Cordes Oct 21 '22 at 00:02
17

Here is a portable solution, that will, however, be slower than solutions taking advantage of specialized instructions such as clz (count leading zeros). I added comments at each step of the algorithm that explain how it works.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

/* return position of set bit, if exactly one of bits n*8-1 is set; n in [1,8]
   return 0 if no bit is set
*/
int bit_pos (uint64_t a)
{
    uint64_t t, c;
    t = a - 1; // create mask
    c = t >> 63; // correction for zero inputs
    t = t + c; // apply zero correction if necessary
    t = t & 0x0101010101010101ULL; // mark each byte covered by mask
    t = t * 0x0101010101010101ULL; // sum the byte markers in uppermost byte
    t = (t >> 53) - 1; // retrieve count and diminish by 1 for bit position
    t = t + c; // apply zero correction if necessary
    return (int)t;
}

int main (void)
{
    int i;
    uint64_t a;
    a = 0;
    printf ("a=%016llx   bit_pos=%2d   reference_pos=%2d\n", a, bit_pos(a), 0);
    for (i = 7; i < 64; i += 8) {
        a = (1ULL << i);
        printf ("a=%016llx   bit_pos=%2d   reference_pos=%2d\n", 
                a, bit_pos(a), i);
    }
    return EXIT_SUCCESS;
}

The output of this code should look like this:

a=0000000000000000   bit_pos= 0   reference_pos= 0
a=0000000000000080   bit_pos= 7   reference_pos= 7
a=0000000000008000   bit_pos=15   reference_pos=15
a=0000000000800000   bit_pos=23   reference_pos=23
a=0000000080000000   bit_pos=31   reference_pos=31
a=0000008000000000   bit_pos=39   reference_pos=39
a=0000800000000000   bit_pos=47   reference_pos=47
a=0080000000000000   bit_pos=55   reference_pos=55
a=8000000000000000   bit_pos=63   reference_pos=63

On an x86_64 platform, my compiler translates bit_pos() into this machine code:

bit_pos PROC 
        lea       r8, QWORD PTR [-1+rcx]
        shr       r8, 63
        mov       r9, 0101010101010101H
        lea       rdx, QWORD PTR [-1+r8+rcx]
        and       rdx, r9
        imul      r9, rdx
        shr       r9, 53
        lea       rax, QWORD PTR [-1+r8+r9]
        ret

[Later update]

The answer by duskwuff made it clear to me that my original thinking was unnecessarily convoluted. In fact, using duskwuff's approach, the desired functionality can be expressed much more concisely as follows:

/* return position of set bit, if exactly one of bits n*8-1 is set; n in [1,8]
   return 0 if no bit is set
*/
int bit_pos (uint64_t a)
{
    const uint64_t magic_multiplier = 
         (( 7ULL << 56) | (15ULL << 48) | (23ULL << 40) | (31ULL << 32) |
          (39ULL << 24) | (47ULL << 16) | (55ULL <<  8) | (63ULL <<  0));
    return (int)(((a >> 7) * magic_multiplier) >> 56);
}

Any reasonable compiler will precompute the magic multiplier, which is 0x070f171f272f373fULL. The code emitted for an x86_64 target shrinks to

bit_pos PROC 
        mov       rax, 070f171f272f373fH
        shr       rcx, 7
        imul      rax, rcx
        shr       rax, 56
        ret
Community
  • 1
  • 1
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • 1
    Damn, yours is excellent too :-). If I can accept two answers, I'll do it. Many many thanks. – Aeoliyan Sep 03 '15 at 07:08
  • 1
    The credit for this specific approach certainly belongs to duskwuff, so I would suggest leaving things as they are. In more general terms, this is an example of an approach called "in-register table lookup". – njuffa Sep 03 '15 at 07:25
  • Very nice. I like your extension, that's very clever! –  Sep 03 '15 at 21:19
15

If you can use POSIX, use the ffs() function from strings.h (not string.h!). It returns the position of the least significant bit set (one indexed) or a zero if the argument is zero. On most implementations, a call to ffs() is inlined and compiled into the corresponding machine instruction, like bsf on x86. The glibc also has ffsll() for long long arguments which should be even more suitable for your problem if available.

fuz
  • 88,405
  • 25
  • 200
  • 352
9

The value mod 0x8C yields a unique value for each of the cases.

This value mod 0x11 is still unique.

The second value in the table is the resulting mod 0x11.

128 9
32768   5
8388608 10
2147483648  0
549755813888    14
140737488355328 2
36028797018963968   4
9223372036854775808     15

So a simple lookup table will suffice.

int find_bit(uint64_t bit){ 
  int lookup[] = { the seventeen values };
  return lookup[ (bit % 0x8C) % 0x11];
}

No branching, no compiler tricks.

For completeness, the array is

{ 31, 0, 47, 15, 55, 0, 0, 7, 23, 0, 0, 0, 39, 63, 0, 0}
Captain Giraffe
  • 14,407
  • 6
  • 39
  • 67
  • 3
    Cool idea! Yet I believe taking modulo is slower than calling `ffs()` as modulo is an expensive operation. – fuz Sep 01 '15 at 19:46
  • @FUZxxl Thanks. ffs might well be fastes.Testing will tell=) OP was asking for a portable solution though, so i twiddled with the values a bit and found this. – Captain Giraffe Sep 01 '15 at 19:48
  • 1
    @FUZxxl I just looked at the bfs x86 instruction for a 486 (archeological data) and it is significantly slower than DIV (about 3-4 times). Testing will tell what approach is fastest. – Captain Giraffe Sep 01 '15 at 19:53
  • True. I believe `bsf` runs in a single cycle nowadays whereas a modulo still takes around 10 cycles in latency, but I need to benchmark to be sure. – fuz Sep 01 '15 at 19:56
  • There was a huge thread that highly optimised code using bsf on some 64 bit Intel systems can be bizarrely slow. – gnasher729 Sep 01 '15 at 22:06
  • @gnasher729 I remember that being about a false dependency on the destination register of `bsf`. I believe the gcc people work around this in their register allocator somehow. – fuz Sep 02 '15 at 12:20
7

If you want an algorithm for the job rather than a built-in, this will do it. It yields the bit number of the most significant 1 bit, even if more than one bit is set. It narrows down the position by iteratively dividing the bit range under consideration into halves, testing whether there are any bits set in the upper half, taking that half as the new bit range if so, and otherwise taking the lower half as the new bit range.

#define TRY_WINDOW(bits, n, msb) do { \
    uint64_t t = n >> bits;           \
    if (t) {                          \
        msb += bits;                  \
        n = t;                        \
    }                                 \
} while (0)

int msb(uint64_t n) {
    int msb = 0;

    TRY_WINDOW(32, n, msb);
    TRY_WINDOW(16, n, msb);
    TRY_WINDOW( 8, n, msb);
    TRY_WINDOW( 4, n, msb);
    TRY_WINDOW( 2, n, msb);
    TRY_WINDOW( 1, n, msb);

    return msb;
}
John Bollinger
  • 160,171
  • 8
  • 81
  • 157
4

C++ tag was removed, but here is a portable C++ answer nonetheless since you can compile it with C++ and use an extern C interface:

If you have a power of 2 and you subtract one you end up with a binary number with the number of set bits equal to the position

A way to count the number of set bits (binary 1s) is wrapped, presumably most efficiently by each implementation of the stl, in std::bitset member function count

Note that your specification has 0 returned for both 0 or 1, so I added as_specified_pos to meet this requirement. Personally I would just leave it return the natural value of 64 when passed 0 to be able to differentiate, and for the speed.

The following code should be extremely portable and most likely optimized per platform by compiler vendors:

#include <bitset>

uint64_t pos(uint64_t val)
{
   return std::bitset<64>(val-1).count();
}

uint64_t as_specified_pos(uint64_t val)
{
    return (val) ? pos(val) : 0;
}

On Linux with g++ I get the following disassembled code:

0000000000000000 <pos(unsigned long)>:
   0:   48 8d 47 ff             lea    -0x1(%rdi),%rax
   4:   f3 48 0f b8 c0          popcnt %rax,%rax
   9:   c3                      retq
   a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000010 <as_specified_pos(unsigned long)>:
  10:   31 c0                   xor    %eax,%eax
  12:   48 85 ff                test   %rdi,%rdi
  15:   74 09                   je     20 <as_specified_pos(unsigned long)+0x10>
  17:   48 8d 47 ff             lea    -0x1(%rdi),%rax
  1b:   f3 48 0f b8 c0          popcnt %rax,%rax
  20:   f3 c3                   repz retq
Glenn Teitelbaum
  • 10,108
  • 3
  • 36
  • 80
3

Modern hardware has specialized instructions for that (LZCNT, TZCNT on Intel processors).

Most compilers have intrinsics to easily generate them. See the following wikipedia page.

BrunoLevy
  • 2,495
  • 17
  • 30
3
00000000 00000000 00000000 00000000 00000000 00000000 00000000 10000000  pos = 7

..., but returns 0 if there is no bit that is set.

This will return the same if the first bit or no bit is set; however, on x86_64, that is exactly what bsrq does:

int bsrq_x86_64(uint64_t x){
  int ret;
  asm("bsrq %0, %1":"=r"(ret):"r"(x));
  return ret;
}

However; if the first bit is set it will also return 0; here is a method that will run in constant time (no looping or branching) and returns -1 when no bits are set (to distinguish from when the first bit is set).

int find_bit(unsigned long long x){
  int ret=0,
  cmp = (x>(1LL<<31))<<5; //32 if true else 0
  ret += cmp;
  x  >>= cmp;
  cmp = (x>(1<<15))<<4; //16 if true else 0
  ret += cmp;
  x  >>= cmp;
  cmp = (x>(1<<7))<<3; //8
  ret += cmp;
  x  >>= cmp;
  cmp = (x>(1<<3))<<2; //4
  ret += cmp;
  x  >>= cmp;
  cmp = (x>(1<<1))<<1; //2
  ret += cmp;
  x  >>= cmp;
  cmp = (x>1);
  ret += cmp;
  x  >>= cmp;
  ret += x;
  return ret-1;
}

Technically this just returns the position of the most significant set bit. Depending on the type of float used, this can be done in fewer operations using the fast inverse square or other bit twiddling hacks

BTW,If don't mind using compiler builtins, you can just do:

__builtin_popcountll(n-1) or __builtin_ctzll(n) or __builtin_ffsll(n)-1

technosaurus
  • 7,676
  • 1
  • 30
  • 52
  • `bsr` is Bit Scan Reverse, scanning from the top of the register. But unlike `lzcnt`, it produces the bit-index of the bit it finds, so it doesn't matter whether you BSR or BSF when you know there's only one set bit. If GNU C inline asm works in a project, so will the better choice of `__builtin_ctzll`. GCC will compile that to `rep bsf` so it can run efficiently on AMD CPUs (where `tzcnt` is fast, `bsf` is a bit slow.) Since `0` is not one of the inputs this problem needs, it's fine to use `__builtin_ctzll` even though it produces an undefined result for input 0. – Peter Cordes Oct 20 '22 at 23:55
  • Or wait, it *does* want `0` for an input value of `0`. So `BSR(x|1)` will work, to find the bit index of the highest set bit. But plain `__builtin_ctzll` won't. `_tzcnt_u64` would give you a `64` in that case. – Peter Cordes Oct 20 '22 at 23:59
-1

A simple lookup solution. m=67 is the smallest integer for which the values (1<<k)%m are all distinct, for k<m. With (python transposable code):

lut = [-1]*67
for i in range(0,64) : lut[(1<<i)%67] = i

Then lut[a%67] gives k if a = 1<<k. -1 values are unused.

B. M.
  • 18,243
  • 2
  • 35
  • 54
  • The question is about a C-specific implementation optimization, not about the algorithm. – Bogdan Alexandru Feb 23 '18 at 14:30
  • The OP wants a fast algorithm. this way you can conclude with a % operation, and a lut access, better an less tricky in my mind than *, >>, and & . – B. M. Feb 23 '18 at 14:58