0

Is there any efficient algorithm (or processor instruction) that will help divide the number (32bit and 64bit) into several numbers, in which there will be only one 1-bit.

I want to isolate each set bit in a number. For example,

input:
01100100

output:

01000000 
00100000
00000100

Only comes to mind number & mask. Assembly or С++.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

3 Answers3

2

Yes, in a similar way as Brian Kernighan's algorithm to count set bits, except instead of counting the bits we extract and use the lowest set bit in every intermediary result:

while (number) {
    // extract lowest set bit in number
    uint64_t m = number & -number;
    /// use m
    ...
    // remove lowest set bit from number
    number &= number - 1;
}

In modern x64 assembly, number & -number may be compiled to blsi, and number &= number - 1 may be compiled to blsr which are both fast, so this would only take a couple of efficient instructions to implement.

Since m is available, resetting the lowest set bit may be done with number ^= m but that may make it harder for the compiler to see that it can use blsr, which is a better choice because it depends only directly on number so it shortens the loop carried dependency chain.

harold
  • 61,398
  • 6
  • 86
  • 164
  • Just for fun, I wrote an AVX512 version that produces the set of masks in memory using two `vpcompressd` stores. Probably not useful; I expect the normal use-case is looping over masks and doing something with each one. – Peter Cordes Oct 12 '19 at 20:27
1

The standard way is

while (num) {
    unsigned mask = num ^ (num & (num-1)); // This will have just one bit set
    ...
    num ^= mask;
}

for example starting with num = 2019 you will get in order

1
2
32
64
128
256
512
1024
6502
  • 112,025
  • 15
  • 165
  • 265
  • 1
    Why not `num & -num` though? – harold Oct 12 '19 at 17:13
  • @harold: Unary minus with an unsigned feels weird to me (and for example is forbidden in MISRA) – 6502 Oct 12 '19 at 17:18
  • 1
    Oh ok, that's odd, unary minus on unsigned is safe but it's unsafe on signed integers – harold Oct 12 '19 at 17:20
  • @harold: I know... still I think that when doing bit fiddling unsigned should be used and the unary minus is a weird operation in that context. Note by the way that `x & -x` and `x ^ (x & (x - 1))` generates (with g++) the same exact machine code. – 6502 Oct 12 '19 at 17:25
  • 1
    `num & -num` is the standard way that I've heard of to isolate the lowest set bit. And then you can separately clear the lowest bit. Your way has much less instruction-level parallelism if it compiles similar to the way it's written: the loop-carried dependency chain is 4 operations long: `-1`, `&`, `^`, then another `^`. The normal way is only 2, with isolating the lowest set bit at each step being an independent chain. – Peter Cordes Oct 12 '19 at 17:26
  • Since 2's complement is well known (and 2's complement operations are the same as binary), subtracting from `0` in unsigned is a cheap/efficient way to take advantage of the 2's complement identities like `-x = ~x + 1` [How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results?](//stackoverflow.com/q/2278518). `0 - x` is not particularly weird; we use unsigned precisely because it has well-defined wraparound behaviour. – Peter Cordes Oct 12 '19 at 17:29
  • Oh, fortunately compilers can see through your double `^` and just use `num & (num-1)` on the critical path, and only one XOR off the critical path to create the single-bit-set temporary in each loop iteration. https://godbolt.org/z/nomk-i. For x86-64 with BMI1, Harold's way compiles to blsi + blsr, yours to blsi + xor and some extra `mov` instructions. But without BMI1, you way has fewer `mov` instructions in a loop that isn't unrolled. And GCC is using `neg` when compiling yours without BMI1, so it is able to optimize the bit-isolate. – Peter Cordes Oct 12 '19 at 17:39
1

If you are going to iterate over the single-bit-isolated masks one at a time, generating them one at a time is efficient; see @harold's answer.


But if you truly just want all the masks, x86 with AVX512F can usefully parallelize this. (At least potentially useful depending on surrounding code. More likely this is just a fun exercise in applying AVX512 and not useful for most use-cases).

The key building block is AVX512F vpcompressd : given a mask (e.g. from a SIMD compare) it will shuffle the selected dword elements to contiguous elements at the bottom of a vector.

An AVX512 ZMM / __m512i vector holds 16x 32-bit integers, so we only need 2 vectors to hold every possible single-bit mask. Our input number is a mask that selects which of those elements should be part of the output. (No need to broadcast it into a vector and vptestmd or anything like that; we can just kmov it into a mask register and use it directly.)

See also my AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask?

#include <stdint.h>
#include <immintrin.h>

// suggest 64-byte alignment for out_array
// returns count of set bits = length stored
unsigned bit_isolate_avx512(uint32_t out_array[32], uint32_t x)
{
    const __m512i bitmasks_lo = _mm512_set_epi32(
           1UL << 15,  1UL << 14,  1UL << 13,  1UL << 12,
           1UL << 11,  1UL << 10,  1UL << 9,   1UL << 8,
           1UL << 7,   1UL << 6,   1UL << 5,   1UL << 4,
           1UL << 3,   1UL << 2,   1UL << 1,   1UL << 0
     );
     const __m512i bitmasks_hi = _mm512_slli_epi32(bitmasks_lo, 16);    // compilers actually do constprop and load another 64-byte constant, but this is more readable in the source.

    __mmask16 set_lo = x;
    __mmask16 set_hi = x>>16;

    int count_lo = _mm_popcnt_u32(set_lo);  // doesn't actually cost a kmov, __mask16 is really just uint16_t
    _mm512_mask_compressstoreu_epi32(out_array, set_lo, bitmasks_lo);
    _mm512_mask_compressstoreu_epi32(out_array+count_lo, set_hi, bitmasks_hi);

    return _mm_popcnt_u32(x);
}

Compiles nicely with clang on Godbolt, and with gcc other than a couple minor sub-optimal choices with mov, movzx, and popcnt, and making a frame pointer for no reason. (It also can compile with -march=knl; it doesn't depend on AVX512BW or DQ.)

# clang9.0 -O3 -march=skylake-avx512
bit_isolate_avx512(unsigned int*, unsigned int):
        movzx   ecx, si
        popcnt  eax, esi
        shr     esi, 16
        popcnt  edx, ecx
        kmovd   k1, ecx
        vmovdqa64       zmm0, zmmword ptr [rip + .LCPI0_0] # zmm0 = [1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
        vpcompressd     zmmword ptr [rdi] {k1}, zmm0
        kmovd   k1, esi
        vmovdqa64       zmm0, zmmword ptr [rip + .LCPI0_1] # zmm0 = [65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648]
        vpcompressd     zmmword ptr [rdi + 4*rdx] {k1}, zmm0
        vzeroupper
        ret

On Skylake-AVX512, vpcompressd zmm{k1}, zmm is 2 uops for port 5. Latency from input vector -> output is 3 cycles, but latency from input mask -> output is 6 cycles. (https://www.uops.info/table.html / https://www.uops.info/html-instr/VPCOMPRESSD_ZMM_K_ZMM.html). The memory destination version is 4 uops: 2p5 + the usual store-address and store-data uops which can't micro-fuse when part of a larger instruction.

It might be better to compress into a ZMM reg and then store, at least for the first compress, to save total uops. The 2nd should probably still take advantage of the masked-store feature of vpcompressd [mem]{k1} so the output array doesn't need padding for it to step on. IDK if that helps with cache-line splits, i.e. whether masking can avoid replaying the store uop for the part with an all-zero mask in the 2nd cache line.

On KNL, vpcompressd zmm{k1} is only a single uop. Agner Fog didn't test it with a memory destination (https://agner.org/optimize/).


This is 14 fused-domain uops for the front-end on Skylake-X for the real work (e.g. after inlining into a loop over multiple x values, so we could hoist the vmovdqa64 loads out of the loop. Otherwise that's another 2 uops). So front-end bottleneck = 14 / 4 = 3.5 cycles.

Back-end port pressure: 6 uops for port 5 (2x kmov(1) + 2x vpcompressd(2)): 1 iteration per 6 cycles. (Even on IceLake (instlatx64), vpcompressd is still 2c throughput, unfortunately, so apparently ICL's extra shuffle port doesn't handle either of those uops. And kmovw k, r32 is still 1/clock, so presumably still port 5 as well.)

(Other ports are fine: popcnt runs on port 1, and that port's vector ALU is shut down when 512-bit uops are in flight. But not its scalar ALU, the only one that handles 3-cycle latency integer instructions. movzx dword, word can't be eliminated, only movzx dword, byte can do that, but it runs on any port.)

Latency: integer result is just one popcnt (3 cycles). First part of the memory result is stored about 7 cycles after the mask is ready. (kmov -> vpcompressd). The vector source for vpcompressd is a constant so OoO exec can get it ready plenty early unless it misses in cache.


Compacting the 1<<0..15 constant would be possible but probably not worth it, by building it with a shift. e.g. loading 16-byte _mm_setr_epi8(0..15) with vpmovzxbd, then using that with vpsllvd on a vector of set1(1) (which you can get from a broadcast or generate on the fly with vpternlogd+shift). But that's probably not worth it even if you're writing by hand in asm (so it's your choice instead of the compiler) since this already uses a lot of shuffles, and constant-generation would take at least 3 or 4 instructions (each of which is at least 6 bytes long; EVEX prefixes alone are 4 bytes each).

I would generate the hi part with a shift from lo, instead of loading it separately, though. Unless the surrounding code bottlenecks hard on port 0, an ALU uop isn't worse than a load uop. One 64-byte constant fills a whole cache line.

You could compress the lo constant with a vpmovzxwd load: each element fits in 16 bits. Worth considering if you can hoist that outside of a loop so it doesn't cost an extra shuffle per operation.


If you wanted the result in a SIMD vector instead of stored to memory, you could 2x vpcompressd into registers and maybe use count_lo to look up a shuffle control vector for vpermt2d. Possibly from a sliding-window on an array instead of 16x 64-byte vectors? But the result isn't guaranteed to fit in one vector unless you know your input had 16 or fewer bits set.


Things are much worse for 64-bit integers 8x 64-bit elements means we need 8 vectors. So maybe not worth it vs. scalar, unless your inputs have lots of bits set.

You can do it in a loop, though, using vpslld by 8 to move bits in vector elements. You'd think kshiftrq would be good, but with 4 cycle latency that's a long loop-carried dep chain. And you need scalar popcnt of each 8-bit chunk anyway to adjust the pointer. So your loop should use shr / kmov and movzx / popcnt. (Using a counter += 8 and bzhi to feed popcnt would cost more uops).

The loop-carried dependencies are all short (and the loop only runs 8 iterations to cover mask 64 bits), so out-of-order exec should be able to nicely overlap work for multiple iterations. Especially if we unroll by 2 so the vector and mask dependencies can get ahead of the pointer update.

  • vector: vpslld immediate, starting from the vector constant
  • mask: shr r64, 8 starting with x. (Could stop looping when this becomes 0 after shifting out all the bits. This 1-cycle dep chain is short enough for OoO exec to zip through it and hide most of the mispredict penalty, when it happens.)
  • pointer: lea rdi, [rdi + rax*4] where RAX holds a popcnt result.

The rest of the work is all independent across iterations. Depending on surrounding code, we probably bottleneck on port 5 with vpcompressd shuffles and kmov

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847