2

Suppose I have a 32 or 64 bit unsigned integer.

What is the fastest way to find the index i of the leftmost bit such that the number of 0s in the leftmost i bits equals the number of 1s in the leftmost i bits? I was thinking of some bit tricks like the ones mentioned here.

I am interested in recent x86_64 processor. This might be relevant as some processor support instructions as POPCNT (count the number of 1s) or LZCNT (counts the number of leading 0s).

If it helps, it is possible to assume that the first bit has always a certain value.

Example (with 16 bits): If the integer is

1110010100110110b 
         ^ 
         i

then i=10 and it corresponds to the marked position.

A possible (slow) implementation for 16-bit integers could be:

mask = 1000000000000000b
pos = 0
count=0
do {
    if(x & mask)
        count++;
    else
        count--;

    pos++;
    x<<=1;
} while(count)

return pos;

Edit: fixed bug in code as per @njuffa comment.

Steven
  • 209
  • 2
  • 10
  • And `i = 0` would be banned, right? Otherwise it's a bit boring – harold Dec 02 '16 at 13:30
  • 1
    Yes indeed :) I must be in the range 1,...,size where size is the number of bits of the integer. – Steven Dec 02 '16 at 13:31
  • I am not clear on the specification. Could you provide a simple (slow) reference implementation of what you have in mind? It seems to me that such a leftmost bit position can't always be found, a simple 32-bit example being 0xFFFFFFFE (or is the result to be 32 in that case?) – njuffa Dec 02 '16 at 14:12
  • 1
    You can assume that such a position always exists. On, in other words, any result is fine if such a position does not exist. Reference implementation in the original post. – Steven Dec 02 '16 at 14:16
  • 1
    @Steven Based on your textual description, the reference code should not return `count`, but rather the number of bits traversed until `count == 0`. – njuffa Dec 02 '16 at 14:50
  • @njuffa You are right! (I wrote the code without testing it) – Steven Dec 02 '16 at 15:02
  • Are the input values uniformly distributed? If so, almost 3/4 of inputs will balance the bits in the first byte, half after the first 2 bits. Perhaps use a small lookup table for the first byte and just fall through and use a calculation for the rest? – samgak Dec 02 '16 at 20:38
  • Unfortunately they are not. – Steven Dec 02 '16 at 21:07
  • So, what is the distribution? That might make a big difference to what is the most efficient algorithm. – samgak Dec 04 '16 at 07:24
  • @samgak I don't know what the input distribution, as the bits encode some other structure. However I would expect the result to be in the first 2 bytes. – Steven Dec 05 '16 at 10:52

3 Answers3

3

I don't have any bit tricks for this, but I do have a SIMD trick.

First a few observations,

  • Interpreting 0 as -1, this problem becomes "find the first i so that the first i bits sum to 0".
  • 0 is even but all the bits have odd values under this interpretation, which gives the insight that i must be even and this problem can be analyzed by blocks of 2 bits.
  • 01 and 10 don't change the balance.

After spreading the groups of 2 out to bytes (none of the following is tested),

// optionally use AVX2 _mm_srlv_epi32 instead of ugly variable set
__m128i spread = _mm_shuffle_epi8(_mm_setr_epi32(x, x >> 2, x >> 4, x >> 6),
                   _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));
spread = _mm_and_si128(spread, _mm_set1_epi8(3));

Replace 00 by -1, 11 by 1, and 01 and 10 by 0:

__m128i r = _mm_shuffle_epi8(_mm_setr_epi8(-1, 0, 0, 1,  0,0,0,0,0,0,0,0,0,0,0,0),
                             spread);

Calculate the prefix sum:

__m128i pfs = _mm_add_epi8(r, _mm_bsrli_si128(r, 1));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 2));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 4));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 8));

Find the highest 0:

__m128i iszero = _mm_cmpeq_epi8(pfs, _mm_setzero_si128());
return __builtin_clz(_mm_movemask_epi8(iszero) << 15) * 2;

The << 15 and *2 appear because the resulting mask is 16 bits but the clz is 32 bit, it's shifted one less because if the top byte is zero that indicates that 1 group of 2 is taken, not zero.

harold
  • 61,398
  • 6
  • 86
  • 164
  • Thanks you for the answer. I understood perfectly the idea of the solution but I'm still not sure what the instructions are doing (and the documentation I found is a bit obscure). Would you mind explaining or pointing me to a good reference? – Steven Dec 02 '16 at 16:01
  • @Steven most of them seem fairly innocent except `pshufb` (`_mm_shuffle_epi8`), is that what the problem is? – harold Dec 02 '16 at 17:40
  • yes, also I don't understand why you are masking with the arguments of the _mm_setr_epi8 call. – Steven Dec 02 '16 at 18:24
2

This is a solution for 32-bit data using classical bit-twiddling techniques. The intermediate computation requires 64-bit arithmetic and logic operations. I have to tried to stick to portable operations as far as it was possible. Required is an implementation of the POSIX function ffsll to find the least-significant 1-bit in a 64-bit long long, and a custom function rev_bit_duos that reverses the bit-duos in a 32-bit integer. The latter could be replaced with a platform-specific bit-reversal intrinsic, such as the __rbit intrinsic on ARM platforms.

The basic observation is that if a bit-group with an equal number of 0-bits and 1-bits can be extracted, it must contain an even number of bits. This means we can examine the operand in 2-bit groups. We can further restrict ourselves to tracking whether each 2-bit increases (0b11), decreases (0b00) or leaves unchanged (0b01, 0b10) a running balance of bits. If we count positive and negative changes with separate counters, 4-bit counters will suffice unless the input is 0 or 0xffffffff, which can be handled separately. Based on comments to the question, these cases shouldn't occur. By subtracting the negative change count from the positive change count for each 2-bit group we can find at which group the balance becomes zero. There may be multiple such bit groups, we need to find the first one.

The processing can be parallelized by expanding each 2-bit group into a nibble that then can serve as a change counter. The prefix sum can be computed via integer multiply with an appropriate constant, which provides the necessary shift & add operations at each nibble position. Efficient ways for parallel nibble-wise subtraction are well-known, likewise there is a well-known technique due to Alan Mycroft for detecting zero-bytes that is trivially changeable to zero-nibble detection. POSIX function ffsll is then applied to find the bit position of that nibble.

Slightly problematic is the requirement for extraction of a left-most bit group, rather than a right-most, since Alan Mycroft's trick only works for finding the first zero-nibble from the right. Also, handling the prefix-sum for left-most bit group require use of a mulhi operation which may not be easily available, and may be less efficient than standard integer multiplication. I have addressed both of these issues by simply bit-reversing the original operand up front.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

/* Reverse bit-duos using classic binary partitioning algorithm */
inline uint32_t rev_bit_duos (uint32_t a)
{
    uint32_t m;
    a = (a >> 16) | (a << 16);                            // swap halfwords
    m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes
    m = (m << 4)^m; a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles
    m = (m << 2)^m; a = ((a >> 2) & m) | ((a << 2) & ~m); // swap bit-duos
    return a;
}

/* Return the number of most significant (leftmost) bits that must be extracted
   to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
   Return 0 if no such bit group exists.
*/   
int solution (uint32_t x)
{
    const uint64_t mask16 = 0x0000ffff0000ffffULL; // alternate half-words
    const uint64_t mask8  = 0x00ff00ff00ff00ffULL; // alternate bytes
    const uint64_t mask4h = 0x0c0c0c0c0c0c0c0cULL; // alternate nibbles, high bit-duo
    const uint64_t mask4l = 0x0303030303030303ULL; // alternate nibbles, low bit-duo
    const uint64_t nibble_lsb = 0x1111111111111111ULL;
    const uint64_t nibble_msb = 0x8888888888888888ULL; 
    uint64_t a, b, r, s, t, expx, pc_expx, nc_expx;
    int res;

    /* common path can't handle all 0s and all 1s due to counter overflow */
    if ((x == 0) || (x == ~0)) return 0;

    /* make zero-nibble detection work, and simplify prefix sum computation */
    x = rev_bit_duos (x); // reverse bit-duos

    /* expand each bit-duo into a nibble */
    expx = x;
    expx = ((expx << 16) | expx) & mask16;
    expx = ((expx <<  8) | expx) & mask8;
    expx = ((expx <<  4) | expx);
    expx = ((expx & mask4h) * 4) + (expx & mask4l);

    /* compute positive and negative change counts for each nibble */
    pc_expx =  expx & ( expx >> 1) & nibble_lsb;
    nc_expx = ~expx & (~expx >> 1) & nibble_lsb;

    /* produce prefix sums for positive and negative change counters */
    a = pc_expx * nibble_lsb;
    b = nc_expx * nibble_lsb;

    /* subtract positive and negative prefix sums, nibble-wise */
    s = a ^ ~b;
    r = a | nibble_msb;
    t = b & ~nibble_msb;
    s = s & nibble_msb;
    r = r - t;
    r = r ^ s;

    /* find first nibble that is zero using Alan Mycroft's magic */
    r = (r - nibble_lsb) & (~r & nibble_msb);
    res = ffsll (r) / 2;  // account for bit-duo to nibble expansion

    return res;
}

/* Return the number of most significant (leftmost) bits that must be extracted
   to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
   Return 0 if no such bit group exists.
*/   
int reference (uint32_t x)
{
    int count = 0;
    int bits = 0;
    uint32_t mask = 0x80000000;
    do {
        bits++;
        if (x & mask) {
            count++;
        } else {
            count--;
        }
        x = x << 1;
    } while ((count) && (bits <= (int)(sizeof(x) * CHAR_BIT)));
    return (count) ? 0 : bits;
}

int main (void)
{
    uint32_t x = 0;
    do {
        uint32_t ref = reference (x);
        uint32_t res = solution (x);
        if (res != ref) {
            printf ("x=%08x  res=%u ref=%u\n\n", x, res, ref);
        }
        x++;
    } while (x);
    return EXIT_SUCCESS;
}
Community
  • 1
  • 1
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • I haven't yet grokked your whole algo. Could you avoid the bit-reverse if you had a trailing-zero count, i.e. a `ffsll` that counted from the other end? (Many platforms including x86 have this in asm, but I forget if there's as portable a C function for it.) Or do you need bits in that order for carry-propagation in the multiplis? – Peter Cordes Dec 07 '16 at 17:46
  • The Mycroft technique for zero-nibble detection *can* mark multiple zero-nibbles, but due to possible carry propagation it can only mark the right most (least significant) zero-nibble with 100% accuracy. I originally forgot about this, used the input straight (without initial reversal), using `clz` to search for the zero-nibble from the left (most significant bits), only to find that my code failed for many arguments > `0xc0000000`. Also, non-reversed processing requires a 64-bit `mulhi` to generate the prefix sums, which typically would have to be coded with intrinsics or inline assembly. – njuffa Dec 07 '16 at 17:59
1

A possible solution (for 32-bit integers). I'm not sure if it can be improved / avoid the use of lookup tables. Here x is the input integer.

//Look-up table of 2^16 elements.
//The y-th is associated with the first 2 bytes y of x.
//If the wanted bit is in y, LUT1[y] is minus the position of the bit
//If the wanted bit is not in y, LUT1[y] is the number of ones in excess in y minus 1 (between 0 and 15)
LUT1 = ....

//Look-up talbe of 16 * 2^16 elements.
//The y-th element is associated to two integers y' and y'' of 4 and 16 bits, respectively.
//y' is the number of excess ones in the first byte of x, minus 1
//y'' is the second byte of x. The table contains the answer to return.
LUT2 = ....

if(LUT1[x>>16] < 0)
    return -LUT1[x>>16];

return LUT2[ (LUT1[x>>16]<<16) | (x & 0xFFFF) ]

This requires ~1MB for the lookup tables. The same idea also works using 4 lookup tables (one per byte of x). The requires more operations but brings down the memory to 12KB.

LUT1 = ... //2^8 elements
LUT2 = ... //8 * 2^8 elements
LUT3 = ... //16 * 2^8 elements
LUT3 = ... //24 * 2^8 elements

y = x>>24
if(LUT1[y] < 0)
    return -LUT1[y];

y = (LUT1[y]<<8) | ((x>>16) & 0xFF);
if(LUT2[y] < 0)
    return -LUT2[y];

y = (LUT2[y]<<8) | ((x>>8) & 0xFF);
if(LUT3[y] < 0)
    return -LUT3[y];

return LUT4[(LUT2[y]<<8) | (x & 0xFF) ];
Steven
  • 209
  • 2
  • 10
  • I benchmarked the solution with 4 lookup tables vs @harold's solution by measuring the total time needed to compute the results for all 32 bit integers. On my machine this solution takes ~5600000 clock ticks while harold's takes ~25000000 clock ticks (per C's clock() function). – Steven Dec 02 '16 at 18:33
  • You can use `perf stat` to easily measure whole programs in core clock cycles, rather than real time. If you don't have memory bottlenecks, that removes CPU frequency scaling (speedstep / turbo) from the equation. (But private per-core L2 cache on Intel CPUs is only 256kiB, so maybe not the case). And BTW, it's a lot easier to read 5.6M vs. 25M. Also, it potentially matters a lot what hardware you tested on, and whether you used AVX. Compiler options in general can matter. – Peter Cordes Dec 07 '16 at 17:24
  • But still, microbenchmarking makes a 2^16 entry LUT look *way* better than it will be in a real program. In a microbenchmark, the whole thing stays hot in L2 (and the parts you use may stay hot in L1, depending on the input). In a real program, other code will evict the LUT between calls to this function. Also, you didn't say whether your microbenchmark tests latency or throughput. – Peter Cordes Dec 07 '16 at 17:28
  • Anyway, useful idea and worth testing, but microbenchmarking is hard, and doesn't always tell you what will be better in a real program. – Peter Cordes Dec 07 '16 at 17:28