Popcount assembly / sum indexes of set bits

Question

I want to sum all indexes of set bits.

http://bitmath.blogspot.com/2023/01/weighted-popcnt.html?m=1 has an interesting implementation:

// sum of indexes of set bits
int A073642(uint64_t n)
{
    return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
          (__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}

(Godbolt: compiler-generated asm for x86-64-v3 (AVX2 like Haswell) for MSVC, GCC, and clang, which interestingly auto-vectorizes four of the popcounts.)

However, I am looking for a way to do it without using popcount so many times.

I try to do this in assembly. The popcount operation is quite fast, but it can be done in smaller number of instructions, because in every popcount we repeat the same phase (especially if hardware popcount isn't available, like on RISC-V perhaps, or x86 before Nehalem).

This is something like a "bit puzzle” and I should probably use some smart masks and only elementary instructions (arithmetic/logical operations, conditional move/set/jump) of assembly, but I don’t know how.

I can’t use AVX-512.

Could you clarify the concern about "using popcount so many times"? Population count is for the most part a low-latency, high-throughput instruction on the most commonly used processor (CPU and GPU) architectures, including modern x86-64 CPUs. See Agner Fog's instruction tables for details. — njuffa, Apr 30 '23 at 07:12
It you only need the sum over a small bit field, you can use a lookup table of all possible bit patterns but that really depends on your use case — Homer512, Apr 30 '23 at 08:10
Hello and welcome to Stack Overflow! Do you have access to any SIMD instruction set extensions? — fuz, Apr 30 '23 at 11:01
You had tagged this [intel][64-bit]. Did you specifically mean Intel CPUs, as opposed to AMD? (Intel CPUs have only 1/clock scalar popcnt, vs. 4/clock popcnt on Zen). — Peter Cordes, Apr 30 '23 at 11:02
I'm thinking of doing this with AVX-512: load a vector (0..63) into a register masked with `n`, then sum horizontally. However, I need to know if OP can use AVX-512 to decide if this is applicable. — fuz, Apr 30 '23 at 11:06
Do you need this for multiple `n` values? If you're summing the results, a "positional popcount" could let you apply the weighting once at the end, to the set of sums for each bit-position. https://github.com/lemire/FastFlagStats / https://github.com/mklarqvist/positional-popcount . Or partially apply some of those techniques? — Peter Cordes, Apr 30 '23 at 11:07
@PeterCordes Great idea! See also my own [positional population count routines](https://github.com/clausecker/pospop). In Plan 9 syntax unfortunately, but it's easy to translate to Intel/AT&T. — fuz, Apr 30 '23 at 11:08
@fuz: I think if we have `vpopcntb` (Ice Lake, AVX512BITALG), we wouldn't actually want to break up separate counts for each bit, maybe just 3 `vpandd` masks and `vpopcntq` + accumulate for the sub-byte fields (into separate accumulators which we can shift and add later). And one `vpopcntb` which we accumulate without overflow to get sums for each byte. (A few `vpaddb` in an unrolled loop, but not so many that overflow is possible, then widen and split to odd/even a couple times). Then outside the loop we can scale and combine the counts for each byte appropriately. — Peter Cordes, Apr 30 '23 at 11:16
@fuz: If we're summing `A073642(arr[i])` over a large array: I'm thinking if we have the sum of popcounts for each of the first 3 masks, and the sum of popcounts for each byte-position separately, we can shuffle and sum outside the loop to get counts for each of the last 3 masks. (e.g. the `FFFF0000` mask sums the counts for 2 of every 4 bytes). Maybe it would be cheaper to do some zero-masking `vpaddb` inside the loop on the `vpopcntb` results to get just the 3 accumulators we want instead of widening each byte separately to 4 counters per dword of input. — Peter Cordes, Apr 30 '23 at 11:23
This function A073642() appears to disregard the value in the lowest order bit position, taking the index there as zero. Is this really what you want? — Simon Goater, Apr 30 '23 at 11:31
Now we know you cannot use AVX-512. Could you tell us which instruction set extensions we can use? There are lots of useful instruction set extensions available for x86-64. Without knowing which one to use, this question is hard to answer. — fuz, Apr 30 '23 at 15:35
Random observation: if `n = sum(a_i*2^i, i=0..63)`, where `a_i` is 0 or 1, then let `fn(x) = sum(a_i*x^i, i=0..63)` so that `n=fn(2)`. Then the result is `A073642(n) = 2*fn'(2)`. I wonder if there is some way to use that? — Nate Eldredge, Apr 30 '23 at 18:02
@fuz: I don't think full positional popcount would be the best bet for `sum += A073642(arr[i])`. Instead I think Harley-Seal would be extremely good, since it preserves positional counts in its temporaries, so you just need a SIMD way to do weighted counts every 16 vectors, like I showed in my answer. But I haven't looked at positional popcount recently, I forget how expensive it is. — Peter Cordes, Apr 30 '23 at 20:42
@Man in white: please don't vandalize your questions by removing all the content and code. In the form you left it, there was no code to show that indices were counting from 0 (so the low big was ignored), and no link to the blog article. — Peter Cordes, Apr 30 '23 at 23:42
Is there any sort of context within which you're solving this problem that may inform the solutions? For example working only with small `n`, or computing this for multiple `n`, or summing the results, that sort of thing. If this is just in a vacuum, I would just say that that code that you have is probably a decent approach, at least it's not obvious to me that there should be a better solution as far as drop-in replacements go. — harold, Apr 30 '23 at 23:49
@PeterCordes The Harley-Seal algorithm effectively computes the positional population count. For population counts, the place-value vectors are summed horizontally (i.e. their population is counted). For positional population counts, we treat them as row row vectors of a matrix and transpose the matrix, giving the desired counts. It's just a slight difference in post-processing. — fuz, May 01 '23 at 10:18

Simon Goater · Answer 1 · 2023-04-30T16:46:31.177

1

A naive but portable way to do this would be to use a look-up table. I tried to compare the performance with the function above. When compiled with -mpopcnt flag, the popcount version is understandably faster, about 2-3x at around 30 cycles. Without the popcount compiler flag, this function is something like twice as fast.

static inline uint32_t A073642_nopopcnt(uint64_t n) {
  static const uint8_t nibbles[16][16] = {
{0, 0, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6},
{0, 4, 5, 9, 6, 10, 11, 15, 7, 11, 12, 16, 13, 17, 18, 22},
{0, 8, 9, 17, 10, 18, 19, 27, 11, 19, 20, 28, 21, 29, 30, 38},
{0, 12, 13, 25, 14, 26, 27, 39, 15, 27, 28, 40, 29, 41, 42, 54},
{0, 16, 17, 33, 18, 34, 35, 51, 19, 35, 36, 52, 37, 53, 54, 70},
{0, 20, 21, 41, 22, 42, 43, 63, 23, 43, 44, 64, 45, 65, 66, 86},
{0, 24, 25, 49, 26, 50, 51, 75, 27, 51, 52, 76, 53, 77, 78, 102},
{0, 28, 29, 57, 30, 58, 59, 87, 31, 59, 60, 88, 61, 89, 90, 118},
{0, 32, 33, 65, 34, 66, 67, 99, 35, 67, 68, 100, 69, 101, 102, 134},
{0, 36, 37, 73, 38, 74, 75, 111, 39, 75, 76, 112, 77, 113, 114, 150},
{0, 40, 41, 81, 42, 82, 83, 123, 43, 83, 84, 124, 85, 125, 126, 166},
{0, 44, 45, 89, 46, 90, 91, 135, 47, 91, 92, 136, 93, 137, 138, 182},
{0, 48, 49, 97, 50, 98, 99, 147, 51, 99, 100, 148, 101, 149, 150, 198},
{0, 52, 53, 105, 54, 106, 107, 159, 55, 107, 108, 160, 109, 161, 162, 214},
{0, 56, 57, 113, 58, 114, 115, 171, 59, 115, 116, 172, 117, 173, 174, 230},
{0, 60, 61, 121, 62, 122, 123, 183, 63, 123, 124, 184, 125, 185, 186, 246}};
  uint32_t i, sumsetbitpos = 0;
  for (i=0;i<16;i++) {
    sumsetbitpos += nibbles[i][(n >> (i << 2)) & 0xf];
  }
  return sumsetbitpos;
}

edited Apr 30 '23 at 16:46

answered Apr 30 '23 at 11:57

Simon Goater

759
1
1
7

How is 16 shift/mask and table lookup operations going to be faster than 6 mask / popcnt instructions? Maybe if you were considering a CPU without hardware popcnt. – Peter Cordes Apr 30 '23 at 12:29
@PeterCordes I didn't say it would be faster. It's probably slower, but doesn't assume there's a 64bit popcount instruction available. The OP asked if there is ' a way to do it without using popcount so many times?'. So, do they mean using popcount fewer times, or not at all? Who knows. I thought there was some value in posting a naive option. – Simon Goater Apr 30 '23 at 12:38
For 64-bit popcount, `__builtin_popcountll`. Don't forget to compile with `-march=native` or x86 `-mpopcnt` to let GCC use hardware popcnt instructions. The OP's function should have maybe 1 call per 6 cycle throughput on a modern Intel x86-64, best case, if it can inline into a loop and reuse those mask constants from registers. (Especially if it can use `andn` to copy-and-and with inverted masks, saving front-end bandwidth.) Possibly even faster on Zen, with better than 1/clock throughput for `popcnt r64` (https://uops.info). But 500 cycles seems way too long for this, too. – Peter Cordes Apr 30 '23 at 16:24
https://godbolt.org/z/q6q88MKMn shows GCC's attempt at the OP's function with `-O3 -march=x86-64-v2` (Nehalem instruction-set). Looks like they typoed `__popcnt64`; the x86 intrinsic in `immintrin.h` is `_popcnt64` (one underscore). Anyway, GCC makes decent use of LEA to do some of the shifting and adding. They could do even better. And of course setting up the 64-bit masks into regs takes large instructions. But yeah if it has to call a helper function for each popcount, that would be slower, but still not 500 cycles I'd expect. – Peter Cordes Apr 30 '23 at 16:30
How did you time it, on what hardware? Did you do warm-up runs? Or was that RDTSC cycles while the CPU was still at idle frequency? – Peter Cordes Apr 30 '23 at 16:32
1

I'm now using RDTSC cycles over 1000 incremented values starting from command line input value., after compiling with -mpopcnt, as expected the popcount version is faster at about 30 cycles vs 50 to 90. I don't know why there's so much variation on my function performance from one run to another. I only measured one call at first when I wrote 500 to 600 cycles, and that was without any -march or -popcount flags. – Simon Goater Apr 30 '23 at 16:38
1

@PeterCordes the MSVC intrinsic (intrin.h) has two underscores – harold Apr 30 '23 at 16:58
Re: microbenchmarking in general: [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) has some info on pitfalls to avoid. – Peter Cordes May 01 '23 at 01:10

Peter Cordes · Answer 2 · 2023-05-03T14:55:38.107

If you're doing total += A073642(arr[i]) over a whole array (like 256 bytes or preferably larger), AVX2 with Harley Seal would be very very good. The bitwise boolean work stays the same cost, only the popcount part gets more expensive (to do the weighting).

If you want separate weighted-popcount results for multiple input numbers (not summing them), you can't use Harley Seal, but SIMD is still effective. This answer shows a way to use SIMD for just a single number (which may be about equal speed to 6x scalar popcnt), but it would be not much more expensive to get 2 results for 2 inputs, or with AVX2, 4 results for 4 inputs. You might need 2 different vector constants for the weights of odd vs. even nibbles.

The SSSE3 / AVX2 pshufb parallel-lookup popcount strategy can be adapted. It might not actually be better than scalar for a single input. (Especially not if you have hardware popcnt support, which has 1/clock throughput on Intel, 3 or 4 per clock on AMD Zen 2 or Zen 3 respectively: https://uops.info/ )

On a CPU with fast SSSE3 pshufb but without hardware popcnt (e.g. 2nd-gen Core 2, aka Penryn), this might even be good. All other CPUs either have fast scalar popcnt or don't have SSSE3, or have it but 128-bit shuffles like pshufb are slower (like on 1st-gen Core, Conroe / Merom).

Instead of doing 6 separate popcounts on 6 AND results, you can do two sets of lookups: one for the sub-nibble masks, giving a result from 0 to 6 within each nibble for the cnt(x & 0xa) + 2*cnt(x & 0xc), and another that just counts normally within each nibble. But instead of just adding all nibbles into one popcount for the whole integer, you keep those nibble popcounts separate and combine them in different ways, so you don't have to redo the splitting and lookup part.

#include <stdint.h>
#include <immintrin.h>

// https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-sse-lookup.cpp  was a starting point for this code.
uint64_t weighted_popcnt_SSE_lookup(uint64_t n) {

    size_t i = 0;

    const __m128i flat_lookup = _mm_setr_epi8(
        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
    );
    const __m128i weighted_lookup = _mm_setr_epi8(
        0, 0, 1, 1, 2, 2, 3, 3,
        3, 3, 4, 4, 5, 5, 6, 6
    );

    const __m128i low_mask = _mm_set1_epi8(0x0f);

    // normally we'd split odd/even nibbles from 128 bits of data.
    // but we're using only a single 64-bit scalar input.
//    __m128i v = _mm_set_epi64x(n>>4, n);  // high nibbles in the high half, low nibbles in the low half of a vector.
  //                                   // movq xmm0, rdi ; movdqa ; psrlw ; punpcklqdq

    __m128i v = _mm_cvtsi64_si128(n);                 // movq
    v = _mm_unpacklo_epi8( v, _mm_srli_epi16(v,4) );  // movdqa ; psrlw ; punpcklbw
               // nibbles unpacked in order: byte 3 of v is n>>(3*4) & 0xf

    __m128i vmasked = _mm_and_si128(v, low_mask);  // pand
    __m128i counts01 = _mm_shuffle_epi8(weighted_lookup, vmasked);  // 0 to 6 according to popcnt(v & 0xa) + 2*popcnt(v & 0xc) in each nibble
    __m128i nibble_counts = _mm_shuffle_epi8(flat_lookup, vmasked);  // SSSE3 pshufb

/*
Each nibble has a different weight, so scale each byte differently when or before summing
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
                                     ^    ^
                                     |    \ 2nd nibble counted once
                                  this nibble scaled by <<3 and <<4.
                  Or <<1 and <<2 if we factor out the <<2 common to all nibble-granularity chunks
*/

   // oh duh, these are just the integers from 15 down to 0.
   __m128i scale_factors = _mm_set_epi8(0b1111<<2, 0b1110<<2, 0b1101<<2, 0b1100<<2, 0b1011<<2, 0b1010<<2, 0b1001<<2, 0b1000<<2,
                                        0b0111<<2, 0b0110<<2, 0b0101<<2, 0b0100<<2, 0b0011<<2, 0b0010<<2, 0b0001<<2, 0b0000<<2);

   // neither input has its the MSB set in any byte, so it doesn't matter which one is the unsigned or signed input.
   __m128i weighted_nibblecounts = _mm_maddubs_epi16(nibble_counts, scale_factors);  // SSSE3 pmaddubsw
   // Even if we didn't scale by the <<2 here, max element value is 15*15 + 14*15 = 435 so wouldn't fit in a byte.
   // Unfortunately byte multiply is only available with horizontal summing of pairs.


    counts01 = _mm_sad_epu8(counts01, _mm_setzero_si128()); // psadbw  hsum unsigned bytes into qword halves

    weighted_nibblecounts = _mm_madd_epi16(weighted_nibblecounts, _mm_set1_epi32(0x00010001));  // pmaddwd sum pairs of words to 32-bit dwords.  1 uop but worse latency than shift/add

    // sum the dword elements and counts01
    __m128i high = _mm_shuffle_epi32(weighted_nibblecounts, _MM_SHUFFLE(3,3, 1,1));  // pshufd or movhlps

    weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, counts01);  // add to the bottom dword of each qword, in parallel with pshufd latency
    weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, high);

    high = _mm_shuffle_epi32(weighted_nibblecounts, _MM_SHUFFLE(3,2, 3,2));  // pshufd, movhlps, or punpckhqdq
    weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, high);

    unsigned result = _mm_cvtsi128_si32(weighted_nibblecounts);  // movd  extract the low nibble
    return result;


   // A different nibble interleave order, like 15,0, 14,1 etc. would make the max sum 225 (15*15), allowing psadbw for hsumming if we factor out the <<2
   // weighted_nibblecounts = _mm_sad_epu8(weighted_nibblecounts, _mm_setzero_si128());  // psadbw
}

Compiled (Godbolt) with clang -O3 -march=penryn (where it might be the best option):

weighted_popcnt_SSE_lookup:             # @weighted_popcnt_SSE_lookup
        movq    xmm0, rdi
        movdqa  xmm1, xmm0
        psrlw   xmm1, 4
        punpcklbw       xmm0, xmm1              # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_1] # xmm1 = [0,0,1,1,2,2,3,3,3,3,4,4,5,5,6,6]
        movdqa  xmm2, xmmword ptr [rip + .LCPI0_2] # xmm2 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
        pshufb  xmm2, xmm0
        pmaddubsw       xmm2, xmmword ptr [rip + .LCPI0_3]
        pshufb  xmm1, xmm0
        pxor    xmm0, xmm0
        pmaddwd xmm2, xmmword ptr [rip + .LCPI0_4]
        psadbw  xmm0, xmm1
        pshufd  xmm1, xmm2, 245                 # xmm1 = xmm2[1,1,3,3]
        paddd   xmm2, xmm0
        paddd   xmm2, xmm1
        pshufd  xmm0, xmm2, 238                 # xmm0 = xmm2[2,3,2,3]
        paddd   xmm0, xmm2
        movd    eax, xmm0
           # 19 single-uop instructions, 17 not counting loading constants
           # Or 16 not counting pxor-zeroing, but that gets overwritten without AVX non-destructive destination encodings.
        ret

For Conroe, the other CPU with SSSE3 but not popcnt, you'd choose different shuffles. 64-bit granularity punpcklqdq is more efficient than punpcklbw although pshufb is still multiple uops. e.g. you'd _mm_set_epi64x(n>>4, n) with movq / movdqa / psrlw / punpcklqdq so the high nibbles of each pair would be in the high half of the vector, not alternating high/low in place-value order. And in the horizontal sum, you'd use movhlps into an old XMM register instead of pshufd to copy-and-shuffle, because 128-bit-wide shuffles with granularity narrower than 64-bit are slower. (See Fastest way to do horizontal SSE vector sum (or other reduction) for details). The Conroe-friendly version would probably not be any worse on Penryn.

TODO: Try an extra pshufb to put bytes into 15,0,14,1,13,2,... order so we can psadbw after pmaddubsw instead of needing to successively widen elements from word to dword to qword. Probably still have to psadbw the counts01 vector separately, since there's no earlier point where we can scale the whole-nibble counts by <<2 before their psadbw without overflowing a byte and losing the ability to use psadbw.

If you were doing this over an array, you'd want to work on 128 bits of input data at once, splitting an input vector to odd/even rather than loading 64 bits at a time and unpacking. And keep the per-nibble counts separate until they're close to overflowing a byte, like the Mula's popcnt_AVX2_lookup which uses vpaddb over a few iterations, deferring vpsadbw until later. (Same technique as in How to count character occurrences using SIMD except you're adding values from 0..4 instead of 0 or 1.) And don't do the full hsum down to one scalar until after a loop. e.g. you want paddd or paddq in the outer loop to accumulate vectors of counts. So again, the same techniques as Mula's popcnt_AVX2_lookup or popcnt_SSE_lookup

This needed 5 vector constants, vs. the 6x popcnt version needing 6 scalar integer constants in registers. Actually 5 constants, since the final mask can instead be shr rax, 32 / popcnt eax, eax. It should be possible to chain LEAs to shift-and-add popcnt results, like lea eax, [rax*2 + rdx], like Horner's Rule for polynomials. e.g. the <<2 and <<3 counts can combine that way, so can the <<4 and <<5 counts. Then lea eax, [rax + rax*4] to combine those pairs.

With masks already in registers (already inverted for use with andn to copy-and-AND, if BMI1 is available), a scalar popcnt version really doesn't take many instructions, assuming you have hardware popcnt. Probably just 5x andn, 1x shr, 6x popcnt, and 5x lea. So that's 17 instructions, the same number of front-end uops as the SIMD version.

Footnote 1: SIMD popcount and Harley-Seal

See http://0x80.pl/articles/sse-popcount.html / https://github.com/WojciechMula/sse-popcount for details of the strategy, which they call sse-lookup or avx2-lookup. Along with others such as harley-seal which accumulate weighted counters across vectors of input data to amortize the lookup costs.

If you're summing this weighted-popcount across multiple inputs, perhaps Harley-Seal could be even more useful than for normal popcnt. It should be a nearly drop-in replacement since Harley-Seal already works by keeping counts for each bit-position separate in its temporaries. The vertical bitwise operations work like a 4-bit binary carry-save adder, with 1 bit in each vector.

So actually 128 or 256 of these counters in parallel, a separate one for each bit position. Inside the loop you do total += 16 * popcount(sixteens); after processing 16 vectors of input, keeping the ones, twos, fours, and eights vectors rolling across iterations. See https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-sse-harley-seal.cpp

A weighted sum only has to change the actual popcount step to weighted_popcount; the bitwise operations stay the same cost. With AVX2 and normal popcount on Skylake, Harley-Seal is only a bit faster than vpshufb lookup over decent-sized arrays (like 2048 bytes). On Intel CPUs, AVX2 lookup and Harley-Seal are almost twice as fast as scalar popcnt. That might not be the case for AMD Zen where scalar popcnt has 3 to 4 per clock throughput (and loads have 3/clock throughput on Zen 3 and later). (https://uops.info/)

Making the popcount step more complicated (and slower) only affects a small part of the cost of Harley Seal, but almost all the work in a pure lookup version that doesn't use Harley-Seal, so it's much more attractive.

Other possible ideas:

If you were summing over many input n values, positional popcount could set you up to apply the weights at the end. https://github.com/lemire/FastFlagStats and https://github.com/mklarqvist/positional-popcount

But probably it's more work to keep all 64 counts fully separate until the end, instead better to only partially apply these techniques and combine according to your weights at times. I expect Harley-Seal is the best bet.

If AVX-512 were available - especially Ice Lake AVX512VPOPCNTDQ or AVX512BITALG, you could use vpopcntb instead of vpshufb, using similar ideas to what I did for SSSE3.

Clang can optimize the 6x popcnt version into asm that does two scalar popcounts, and one AVX2 popcount of the input broadcasted and masked 4 different ways, in different elements of a YMM vector. In general it can vectorize sums of popcounts, including this case of the sum of 6 popcounts.

Have you tried to benchmark your function? When I compare it the way I described below to my function and the popcount version, yours is generally a bit faster than the popcount version. I get quite a lot of variation in the performance from one run to another, and sometimes yours is slower than the popcount version. I'm interested to know how you benchmark a function like this, and do you get similar results to what I just described? Good job making use of SSE by the way. If you're going to assume SSSE3 support though, it isn't a big step to assume popcnt support. — Simon Goater, Apr 30 '23 at 22:33
@SimonGoater: I didn't try benchmarking it. I don't have a Penryn CPU (although I do have a Conroe), and like I said in the answer, that's the one x86 microarchitecture that has `pshufb` but not hardware `popcnt`. As you say, it wasn't a big gap between when those were new. — Peter Cordes, Apr 30 '23 at 23:19
@SimonGoater: To benchmark it, I'd normally put it in a loop calling it 100 million times, and run the whole process under `perf stat ./a.out`. (Aiming for a total benchmark time between about 100 ms and 1 sec; long enough that startup overhead is negligible, and still short enough to not wait long.) Preferably in a static executable with no libc to minimize startup time, like a hand-written asm `_start` if I'm benchmarking an asm function. (This question is tagged [assembly] not [c].) Like in [Can x86's MOV really be "free"? Why can't I reproduce this?](//stackoverflow.com/q/44169342) — Peter Cordes, Apr 30 '23 at 23:20
@SimonGoater: Before spending any time benchmarking, I was wondering if the OP would say anything about whether they were interested in summing this function over an array of uint64_t elements. If so, that's much more amenable to SIMD optimization. But I guess probably not, they probably just saw that blog article and were curious about implementation possibilities. — Peter Cordes, Apr 30 '23 at 23:47
Thanks for the response. Isn't using a static executable cheating/unrealistic? Would you expect it to make any difference with something like this when there are no library calls in the function anyway? I started with 1000 calls and upped it to 100000 calls but still saw significant variations. Maybe I'll try again with 100M calls to see if I get stable performance numbers. Is it context switching that causes the variations do you think? — Simon Goater, May 01 '23 at 07:47
@SimonGoater: We want to benchmark the function under test, not process startup overhead. A static executable (especially with no CRT startup code, just your own asm `_start`) minimizes that. It's "cheating" only if the original goal had been to optimize a normal *program* that computed this from a command line arg, rather than *function*. Re: static being faster: try `strace ./a.out` and see how many system calls get made in a program built from `int main(){}` vs. `__attribute__((force_align_arg_pointer)) _start(){ asm("mov $231,%eax; syscall" : "memory"); }` (with `gcc -nostdlib`) — Peter Cordes, May 01 '23 at 12:48
@SimonGoater: Variability can come from startup overhead like TLB misses and page faults, and CPU frequency ramping up from idle to max-turbo at some unpredictable point after starting the process. (RDTSC doesn't count core clock cycles, it's wall-clock equivalent, unlike the `cycles` event from `perf stat`.) Running it a few times back-to-back (from one command line) can let one run work as warm-up for another, or stuff like `perf stat -r10 ./test` to run it multiple times, but that doesn't discard outliers. — Peter Cordes, May 01 '23 at 12:52
I wasn't timing the whole program. I have a function which has the RDTSC in assembly to return the cycle number, I called it before and after each batch of calls, so 1000 and later on 100000 iterations and no system calls. In my program the popcount version is called first and those numbers were quite stable so I don't think it is a warm-up issue. Maybe the look-up table is being kicked out of the L1 cache too frequently or something? It is an old laptop processor, 1st gen Core i5! I have a couple of other machines I could try it on but they're all quite old now. — Simon Goater, May 01 '23 at 13:48
@SimonGoater: IDK then, that's odd. The reason I use `perf stat` is to get counts of core clock cycles, not just RDTSC. Also other events like `taskset -c 1 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,idq.mite_uops,branch-misses,branches ./a.out` (idq.mite_uops is especially useful on Skylake because the JCC erratum can make it fall back to legacy decode instead of the uop cache. Your Nehalem doesn't have a uop cache, just a loop buffer for decoded uops.) — Peter Cordes, May 01 '23 at 14:23
@SimonGoater: BTW, I finished a pure "bithack" version that might be what the OP was expecting / hoping for. I posted it as another answer, if you're interested in having a look. (Or benchmarking it if you already have a test loop set up. Although that's probably benchmarking throughput, but if you're doing nothing but this in a loop, SIMD would be better. So actual use-cases with this in a loop would probably be overlapping this work with other surrounding code in a loop body, not itself. Or would care about latency.) — Peter Cordes, May 03 '23 at 02:40
Okay, the results are in! Firstly, I'd like to point out that you were right about the CPU not getting up to full clock speed with only 100000 or fewer iterations, so I did like you suggested and increased the number of iterations to 100M and started to get stable results. Any real implementation would probably do some other work between calls, which I think would favour the bithacks over those that use look-up tables, but there are now six functions in my test program whcih gave the following approximate average number of 'cycles' per call. — Simon Goater, May 03 '23 at 10:15
In the following order, HWpopcount64, MyLUT, SSE, bithack0, bithack1, bithack2, the results on my laptop CPU (1st gen Core i5) are 12, 21, 10, 26, 25, 28. — Simon Goater, May 03 '23 at 10:20
Actually, making the bithack static inline functions changed the final three results to 12, 20, 12. — Simon Goater, May 03 '23 at 10:27
@SimonGoater: can you post your test framework somewhere, e.g. a https://godbolt.org/ shortlink? I'm curious to try on Skylake if I don't have to bother writing my own caller and test loops (and also to have comparable numbers to your CPU). Nehalem, without a uop cache, is highly susceptible to front-end bottlenecks, especially with long instructions like 10-byte `mov r64, imm64` which these functions need for 64-bit constants. (This may be why inlining helped so much, as masks could be hoisted out of a loop while still keeping the real work.) — Peter Cordes, May 03 '23 at 12:57
Here's the code I was using. https://github.com/FastAsChuff/Weighted-Popcount-Benchmarks/blob/main/wpopcnt.c I'd be interested to see your Skylake results. — Simon Goater, May 03 '23 at 13:14
@SimonGoater: I sent a pull request with my Skylake test results. (After improvements to the microbenchmarks. Without inlining to allow hoisting the constants or auto-vectorization, the SSSE3 version (compiled with AVX enabled) is actually fastest! The scalar popcount version benefits a lot from not having to generate its constants every time. (Keeping them in memory for `and reg, [mem]` would probably be faster.) For non-AVX code-gen where the SSSE3 version might need a couple `movdqa` register-copy instructions, the `-march=core2 -mtune=skylake` version is still about as fast. — Peter Cordes, May 03 '23 at 15:46

Peter Cordes · Answer 3 · 2023-05-03T15:57:28.880

TL:DR: a classic bithack, using two multiplier constants, one for a flat sum of weights-within-bytes, and one for a weighted sum of flat counts-within-bytes, applying the right weight for each 8-bit group.

This is still slower on a machine with fast hardware popcount. It appears my SSSE3 version for a single number is fastest, especially without inlining and hoisting the constants out of the loop. (And without auto-vectorizing a bithack over multiple numbers, which is quite effective with AVX2). This is vastly better than naively calling a bithack popcount 6 times: only about twice the throughput cost of one flat-popcount done with bithacks. See

https://github.com/FastAsChuff/Weighted-Popcount-Benchmarks (Simon's original benchmark framework. Run it a couple times in a row so warm-up effects don't hurt the first test.)
https://github.com/pcordes/Weighted-Popcount-Benchmarks (my fork including Skylake benchmark results and the current bithack version in this answer.)

If you have hardware support for popcount, like the Intel / MSVC __popcnt64 intrinsic you show in your example, one x86 popcnt instruction costs the same as a scalar integer multiply on Intel (1/clock throughput, 3 cycle latency). Or the same as an integer add on Zen (4/clock throughput with 1 cycle latency on Zen 3 and later, before that 3/clock - https://uops.info). So most of the cost is in the masks (10-byte mov r64, imm64), and applying them and adding the results (with lea to shift-and-add in a single cheap instruction.) So probably 5x andn + 1x shr to feed 6x popcnt, and 5x lea to accumulate the results.

Anything you do with only traditional bit-manipulation operations (shifts, bitwise booleans, add/sub) is going to be sub-optimal on modern x86, or any ISA with hardware popcount or SIMD that's usable without big stalls to get data between it and GP integer domains. So maybe modern RISC-V without extension B's cpop instruction, or backwards-compatible x86-64 code without SSSE3 or popcnt.

Using pure bithacks instead of hardware popcount, you're right that there's some scope for overlapping work between the popcounts, instead of starting each one from scratch with a different input. I'm posting this answer partly since you seemed to be curious about doing this with purely basic bit-manipulation stuff. Maybe there's room for more optimization, but it doesn't look good compared to haredware popcnt.

Consider the standard SWAR bithack as described in Count the number of set bits in a 32-bit integer and https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

// standard SWAR popcount for uint64_t i
     uint64_t ones = (n >> 1) & 0x5555555555555555;   // 0xaa>>1
        // ones is actually just the high half of each pair, the <<0 weighted bits
     uint64_t pairs = n - ones;   // optimization over (n&0x555...) + ones
     uint64_t quads = (pairs & 0x3333333333333333) + ((pairs >> 2) & 0x3333333333333333);
     uint64_t octs = (quads + (quads >> 4)) & 0x0F0F0F0F0F0F0F0F;  // fields are wide enough relative to their values that we can defer masking until after adding.
     return (octs * 0x0101010101010101uLL) >> (64-8);     // horizontal sum of bytes

 // weighted might need another shift/add step to widen to 16-bit before multiply,
 // since the weighted-sum result won't fit in a byte.

There are obvious similarities to the masking you're doing to feed each separate popcount. With the temporaries involving popcount within 2-bit, 4-bit, and 8-bit chunks of the input, we can accumulate weighted sums alongside the flat sums. At each step, we can add pairs of weighted elements like we're doing for the flat elements, and add in the next higher weight.

Since the narrowest elements have the smallest weight, it turns out the elements are always wide enough at every step to hold the weighted sum without overflow. For example, a 4-bit group with count(n & 0xa) + 2*count(n & 0xc) can have value 0..6, which fits in 3 bits so the high bit of each group is always zero. (This lets later steps defer masking until after adding, without carry-out corrupting the next group.)

At each step, we can shift and add the previous flat sums to get weighted sums within 4-bit groups, for example. The chain of flat sums is also needed. e.g. the 0xaa and 0xcc masks discard the low bit, but that bit does get counted as part of later masks, except in the lowest nibble that isn't part of any wider group that isn't discarded.

Fortunately the shifting/masking to reduce the weighted sums to wider elements can use the same masks as the flat steps, saving instructions (especially for the case where this can't inline inside a loop, although you'd want to use SIMD there anyway.) It's the same shift/mask/add (or shift/add/mask) as for the flat popcount.

Unlike flat sums, applying weights along the way leads to the values getting quite large, so we do have to keep doing some masking at each step. And it limits the options for summing with a multiplier constant like (x * 0x01010101...)>>(64-8), since that only works when the sum fits in the top chunk the size of 1 element. It also has to continue past that point of 8-bit chunks because we need weights for 16-bit and 32-bit chunks.

Working version using just plain C integer bit-manipulation

#include <stdint.h>
// value-range 0 to 2016 fits in a 16-bit integer, so masking wider than that can be deferred
//inline
unsigned A073642_bithack(uint64_t n)
{
    uint64_t ones = (n >> 1) & 0x5555555555555555;   // 0xaa>>1
    uint64_t pairs = n - ones;  // standard popcount within pairs, not weighted

    uint64_t weighted_pairs = ((ones + (ones>>2)) & 0x3333333333333333)
        + ((pairs >> 2) & 0x3333333333333333) * 2;   // 0..6 range within 4-bit nibbles
             // aka  (pairs>>1) & (0x3333333333333333<<1).  But x86-64 can efficiently use LEA to shift-and-add in one insn so it's better to shift to match the same mask.
    uint64_t quads = (pairs & 0x3333333333333333) + ((pairs >> 2) & 0x3333333333333333);   // standard popcount within each nibble, 0..4
     
    // reduce weighted pairs (0xaa and 0xcc masks) and add __popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2)
    // resulting in 8 buckets of weighted sums, each a byte wide
    uint64_t weighted_quads = ((weighted_pairs + (weighted_pairs >> 4)) & 0x0F0F0F0F0F0F0F0F) 
              + (4*((quads >> 4) & 0x0F0F0F0F0F0F0F0F));  // 0 to 2*6 + 4*4 = 28 value-range
                // need some masking before adding.  4*quads can be up to 16 so can't use (quads >> (4-2)) & 0x0F0F0F0F0F0F0F0F
    uint64_t octs = (quads + (quads >> 4)) & 0x0F0F0F0F0F0F0F0F;  //  0..8   quad fields don't have their high bit set, so we can defer masking until after adding without overflow into next field

    // two separate sums of 8-bit groups, one of the first 3 weights
    unsigned sum_weighted_quads = (weighted_quads * 0x0101010101010101uLL)>>(64-8); // barely fits in 1 byte: max val 28 * 8 = 224
    // the second applying the last 3 relative weights to the flat sums
    unsigned sum_octweights = (octs * 0x0001020304050607uLL) >> (64-8);  // separate weight for each byte group, with <<3 factored out so the max sum is also 224 = 32*(1<<0 + 1<<1 + 1<<2)
    return sum_weighted_quads + 8 * sum_octweights; // apply the <<3 factored out of byte weights
}

I finally came up with a good cleanup strategy for octs and weighted_quads, with just 2 multipliers. An earlier version of that which widened to 16-bit chunks is in the revision history. The Godbolt link below has two #if alternate cleanups that keep widening with shifts / masks, perhaps good on a machine with quite slow multiply.

AArch64 can do this efficiently (27 instructions vs. 36 for x86-64) with its repeated-bit-pattern immediates for bitwise boolean operations allowing and-immediate for those 64-bit repeating constants. And a shifted source operand for some instructions like add x1, x1, x1, lsr 4. (On AArch64, the versions with more shift/add and fewer multiplies didn't cost many more instructions. But the change that introduced it saved 8 x86-64 instructions vs. saving 4 on AArch64.)

BMI1 andn will help on x86-64, allowing copy-and-add instead of two separate instructions (mov / and). Put ~0x3333... in a register and use the constant as the first operand so you're actually anding with 0x3333.... like we need to.

BMI2 rorx is a handy copy-and-shift if you're going to mask the result anyway. Normally you'd never use this SWAR bithack for popcount if BMI1/BMI2 were available, because that would imply popcnt was available. But here it's not a straight popcount so we can take full advantage. e.g. the first few steps would look like this:

Unfortunately neither GCC nor clang use these tricks, so their asm output looks like: (Godbolt with test harness)

# GCC
A073642_bithack:
        mov     rcx, rdi
        movabs  rdx, 6148914691236517205   # 0x5555555555555555
        mov     rax, rdi
        movabs  rdi, 3689348814741910323   # 0x3333333333333333
        shr     rcx               # better: rorx rcx, rdi, 1 since we're masking the result
        and     rcx, rdx
        sub     rax, rcx
        mov     rdx, rcx
        mov     rsi, rax
        shr     rdx, 2
        and     rax, rdi          # copy-and-AND with `andn` with an inverted mask could avoid the mov rsi, rax to preserve a copy.
        add     rdx, rcx
        shr     rsi, 2
        and     rsi, rdi
        and     rdx, rdi
        lea     r8, [rdx+rsi*2]
        add     rax, rsi
        movabs  rsi, 1085102592571150095  # 0x0F0F0F0F0F0F0F0F
        mov     rdx, rax
        mov     rcx, r8
        shr     rdx, 4
        shr     rcx, 4
        add     rcx, r8
        mov     rdi, rdx
        add     rdx, rax
        movabs  rax, 283686952306183      # 0x0001020304050607
        and     rdi, rsi
        and     rcx, rsi
        and     rdx, rsi
        lea     rcx, [rcx+rdi*4]
        imul    rdx, rax
        movabs  rdi, 72340172838076673    # 0x0101010101010101
        imul    rcx, rdi
        shr     rdx, 56
        shr     rcx, 56
        lea     eax, [rcx+rdx*8]
        ret

This is 36 instructions not counting the ret (GCC for x86-64), vs. 19 for a classic flat popcount like you might use without HW popcount (or like GCC uses for __builtin_popcountll without -mpopcnt or -march=native). 5 of the 36 instructions are movabs r64, imm64, which can hoist out of a loop if we're doing this repeatedly. (But ideally use SIMD for that.)

Using (octs * 0x0001020304050607uLL) >> (64-16) to do a weighted sum of eight 8-bit chunks saved many shifting and masking instructions.

The alternate cleanup versions (included in the Godbolt link) are about 51 or about 56 instructions, doing more shift / mask / add steps with 0 or 1 multiply instead of 2. I haven't thrown this into uiCA, LLVM-MCA, or IACA to analyze for throughput, but these are all single-uop instructions and a lot of them can run on any port. There are a lot of shifts, though, so we might have p0/p6 bottlenecks on Intel.

By comparison, a classic bithack popcount (-mno-popcnt to defeat optimizing this idiom into the HW instruction) compiles to 19 x86-64 instructions including four mov r64, imm64 (3 masks and a multiplier). A073642_popcnt (your function) using hardware popcnt is 26 instructions (including the mask setup).

So it's not great, probably also worse than my SSSE3 SIMD answer even for a single input.

Clang does worse with some versions, inventing another constant to mask with (0x1C1C1C1C1C1C1C1C from "optimizing" 4*((quads >> 4) & 0x0F0F0F0F0F0F0F0F) into ((quads >> 2) & (0x0F0F0F0F0F0F0F0F<<2)) instead of using the 0x0F mask we already needed, and using LEA to shift-and-add for nearly the same cost as add.) This is bad on x86-64 where it takes a 10-byte instruction to put it in a register; unlike AArch64 where it can get used as an immediate for an and instructions. Clang also sometimes invented 0x1111... = 0x5555 & 0x3333... when I wrote an early step a different way, masking ones and ones>>2 each separately with 0x3333 before adding.

SIMD shifts would make some of the masking unnecessary, and make hsumming easier. e.g. vpsrlw xmm1, xmm0, 8 would implicitly mask with 0x00ff00ff00ff00ff since bits get discarded at the bottom of each 16-bit element. It would also make horizontal sum of bytes without overflow cheap ([v]psadbw against zero), letting us add two things before one hsum, instead of needing to keep them separate to avoid the sums overflowing a byte. [v]pmaddubsw could apply separate weights to each byte before that, without overflowing the total if we factor out *8 from it like we do here.

PS: weighted sum within a 4-bit nibble looks like this:

0000 : 0 + 0
0001 : 0 + 0
0010 : 1 + 0    = 1
0011 : 1 + 0

0100 : 0 + 1*2  = 2
0101 : 0 + 1*2
0110 : 1 + 1*2  = 3
0111 : 1 + 1*2

1000 : 1 + 1*2  = 3
1001 : 1 + 1*2
1010 : 2 + 1*2  = 4
1011 : 2 + 1*2

1100 : 1 + 2*2  = 5
1101 : 1 + 2*2
1110 : 2 + 2*2  = 6
1111 : 2 + 2*2

Initial design notes:

If we can generate those 0..6 values instead of the usual 0..4 integer values in each 4-bit SWAR accumulator, we can carry on with the rest of the popcount pattern just once to get popcnt(n & 0xAA...AA) + (popcnt(n & 0xCC...CC) << 1).

If we can add in the popcnt(n & 0xF0F0F0F0F0F0F0F0)<<2 weights at some earlier stage, as soon as they'll fit without overflowing, then the later stages of that don't need to be done separately either. 4<<2 is 16, so it won't fit in a nibble but will fit in a byte. (Even when added to 2 * 6 = 12 from the sum of weights from bit-within-nibble place values for two nibbles of all-ones.)

To some degree we actually need to keep things separate because the higher weights need combinations of just the flat popcounts for nibbles, bytes, words, etc. without contributions from the small-granularity weighted sums. Still, it looks like we can use the temporaries from the flat popcnt to feed the weighted sums along the way.

Popcount assembly / sum indexes of set bits

3 Answers3

Footnote 1: SIMD popcount and Harley-Seal

Other possible ideas:

Working version using just plain C integer bit-manipulation

Linked