If you're doing total += A073642(arr[i])
over a whole array (like 256 bytes or preferably larger), AVX2 with Harley Seal would be very very good. The bitwise boolean work stays the same cost, only the popcount part gets more expensive (to do the weighting).
If you want separate weighted-popcount results for multiple input numbers (not summing them), you can't use Harley Seal, but SIMD is still effective. This answer shows a way to use SIMD for just a single number (which may be about equal speed to 6x scalar popcnt
), but it would be not much more expensive to get 2 results for 2 inputs, or with AVX2, 4 results for 4 inputs. You might need 2 different vector constants for the weights of odd vs. even nibbles.
The SSSE3 / AVX2 pshufb
parallel-lookup popcount strategy can be adapted. It might not actually be better than scalar for a single input. (Especially not if you have hardware popcnt
support, which has 1/clock throughput on Intel, 3 or 4 per clock on AMD Zen 2 or Zen 3 respectively: https://uops.info/ )
On a CPU with fast SSSE3 pshufb
but without hardware popcnt
(e.g. 2nd-gen Core 2, aka Penryn), this might even be good. All other CPUs either have fast scalar popcnt
or don't have SSSE3, or have it but 128-bit shuffles like pshufb
are slower (like on 1st-gen Core, Conroe / Merom).
Instead of doing 6 separate popcounts on 6 AND results, you can do two sets of lookups: one for the sub-nibble masks, giving a result from 0 to 6 within each nibble for the cnt(x & 0xa) + 2*cnt(x & 0xc)
, and another that just counts normally within each nibble. But instead of just adding all nibbles into one popcount for the whole integer, you keep those nibble popcounts separate and combine them in different ways, so you don't have to redo the splitting and lookup part.
#include <stdint.h>
#include <immintrin.h>
// https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-sse-lookup.cpp was a starting point for this code.
uint64_t weighted_popcnt_SSE_lookup(uint64_t n) {
size_t i = 0;
const __m128i flat_lookup = _mm_setr_epi8(
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
);
const __m128i weighted_lookup = _mm_setr_epi8(
0, 0, 1, 1, 2, 2, 3, 3,
3, 3, 4, 4, 5, 5, 6, 6
);
const __m128i low_mask = _mm_set1_epi8(0x0f);
// normally we'd split odd/even nibbles from 128 bits of data.
// but we're using only a single 64-bit scalar input.
// __m128i v = _mm_set_epi64x(n>>4, n); // high nibbles in the high half, low nibbles in the low half of a vector.
// // movq xmm0, rdi ; movdqa ; psrlw ; punpcklqdq
__m128i v = _mm_cvtsi64_si128(n); // movq
v = _mm_unpacklo_epi8( v, _mm_srli_epi16(v,4) ); // movdqa ; psrlw ; punpcklbw
// nibbles unpacked in order: byte 3 of v is n>>(3*4) & 0xf
__m128i vmasked = _mm_and_si128(v, low_mask); // pand
__m128i counts01 = _mm_shuffle_epi8(weighted_lookup, vmasked); // 0 to 6 according to popcnt(v & 0xa) + 2*popcnt(v & 0xc) in each nibble
__m128i nibble_counts = _mm_shuffle_epi8(flat_lookup, vmasked); // SSSE3 pshufb
/*
Each nibble has a different weight, so scale each byte differently when or before summing
(__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
(__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
(__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
(__popcnt64(n & 0xFFFFFFFF00000000) << 5);
^ ^
| \ 2nd nibble counted once
this nibble scaled by <<3 and <<4.
Or <<1 and <<2 if we factor out the <<2 common to all nibble-granularity chunks
*/
// oh duh, these are just the integers from 15 down to 0.
__m128i scale_factors = _mm_set_epi8(0b1111<<2, 0b1110<<2, 0b1101<<2, 0b1100<<2, 0b1011<<2, 0b1010<<2, 0b1001<<2, 0b1000<<2,
0b0111<<2, 0b0110<<2, 0b0101<<2, 0b0100<<2, 0b0011<<2, 0b0010<<2, 0b0001<<2, 0b0000<<2);
// neither input has its the MSB set in any byte, so it doesn't matter which one is the unsigned or signed input.
__m128i weighted_nibblecounts = _mm_maddubs_epi16(nibble_counts, scale_factors); // SSSE3 pmaddubsw
// Even if we didn't scale by the <<2 here, max element value is 15*15 + 14*15 = 435 so wouldn't fit in a byte.
// Unfortunately byte multiply is only available with horizontal summing of pairs.
counts01 = _mm_sad_epu8(counts01, _mm_setzero_si128()); // psadbw hsum unsigned bytes into qword halves
weighted_nibblecounts = _mm_madd_epi16(weighted_nibblecounts, _mm_set1_epi32(0x00010001)); // pmaddwd sum pairs of words to 32-bit dwords. 1 uop but worse latency than shift/add
// sum the dword elements and counts01
__m128i high = _mm_shuffle_epi32(weighted_nibblecounts, _MM_SHUFFLE(3,3, 1,1)); // pshufd or movhlps
weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, counts01); // add to the bottom dword of each qword, in parallel with pshufd latency
weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, high);
high = _mm_shuffle_epi32(weighted_nibblecounts, _MM_SHUFFLE(3,2, 3,2)); // pshufd, movhlps, or punpckhqdq
weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, high);
unsigned result = _mm_cvtsi128_si32(weighted_nibblecounts); // movd extract the low nibble
return result;
// A different nibble interleave order, like 15,0, 14,1 etc. would make the max sum 225 (15*15), allowing psadbw for hsumming if we factor out the <<2
// weighted_nibblecounts = _mm_sad_epu8(weighted_nibblecounts, _mm_setzero_si128()); // psadbw
}
Compiled (Godbolt) with clang -O3 -march=penryn
(where it might be the best option):
weighted_popcnt_SSE_lookup: # @weighted_popcnt_SSE_lookup
movq xmm0, rdi
movdqa xmm1, xmm0
psrlw xmm1, 4
punpcklbw xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
pand xmm0, xmmword ptr [rip + .LCPI0_0]
movdqa xmm1, xmmword ptr [rip + .LCPI0_1] # xmm1 = [0,0,1,1,2,2,3,3,3,3,4,4,5,5,6,6]
movdqa xmm2, xmmword ptr [rip + .LCPI0_2] # xmm2 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
pshufb xmm2, xmm0
pmaddubsw xmm2, xmmword ptr [rip + .LCPI0_3]
pshufb xmm1, xmm0
pxor xmm0, xmm0
pmaddwd xmm2, xmmword ptr [rip + .LCPI0_4]
psadbw xmm0, xmm1
pshufd xmm1, xmm2, 245 # xmm1 = xmm2[1,1,3,3]
paddd xmm2, xmm0
paddd xmm2, xmm1
pshufd xmm0, xmm2, 238 # xmm0 = xmm2[2,3,2,3]
paddd xmm0, xmm2
movd eax, xmm0
# 19 single-uop instructions, 17 not counting loading constants
# Or 16 not counting pxor-zeroing, but that gets overwritten without AVX non-destructive destination encodings.
ret
For Conroe, the other CPU with SSSE3 but not popcnt
, you'd choose different shuffles. 64-bit granularity punpcklqdq
is more efficient than punpcklbw
although pshufb
is still multiple uops. e.g. you'd _mm_set_epi64x(n>>4, n)
with movq
/ movdqa
/ psrlw
/ punpcklqdq
so the high nibbles of each pair would be in the high half of the vector, not alternating high/low in place-value order. And in the horizontal sum, you'd use movhlps
into an old XMM register instead of pshufd
to copy-and-shuffle, because 128-bit-wide shuffles with granularity narrower than 64-bit are slower. (See Fastest way to do horizontal SSE vector sum (or other reduction) for details). The Conroe-friendly version would probably not be any worse on Penryn.
TODO: Try an extra pshufb
to put bytes into 15,0,14,1,13,2,... order so we can psadbw
after pmaddubsw
instead of needing to successively widen elements from word to dword to qword. Probably still have to psadbw
the counts01
vector separately, since there's no earlier point where we can scale the whole-nibble counts by <<2
before their psadbw
without overflowing a byte and losing the ability to use psadbw
.
If you were doing this over an array, you'd want to work on 128 bits of input data at once, splitting an input vector to odd/even rather than loading 64 bits at a time and unpacking. And keep the per-nibble counts separate until they're close to overflowing a byte, like the Mula's popcnt_AVX2_lookup
which uses vpaddb
over a few iterations, deferring vpsadbw
until later. (Same technique as in How to count character occurrences using SIMD except you're adding values from 0..4 instead of 0 or 1.) And don't do the full hsum down to one scalar until after a loop. e.g. you want paddd
or paddq
in the outer loop to accumulate vectors of counts. So again, the same techniques as Mula's popcnt_AVX2_lookup
or popcnt_SSE_lookup
This needed 5 vector constants, vs. the 6x popcnt
version needing 6 scalar integer constants in registers. Actually 5 constants, since the final mask can instead be shr rax, 32
/ popcnt eax, eax
. It should be possible to chain LEAs to shift-and-add popcnt results, like lea eax, [rax*2 + rdx]
, like Horner's Rule for polynomials. e.g. the <<2
and <<3
counts can combine that way, so can the <<4
and <<5
counts. Then lea eax, [rax + rax*4]
to combine those pairs.
With masks already in registers (already inverted for use with andn
to copy-and-AND, if BMI1 is available), a scalar popcnt version really doesn't take many instructions, assuming you have hardware popcnt
. Probably just 5x andn
, 1x shr
, 6x popcnt
, and 5x lea
. So that's 17 instructions, the same number of front-end uops as the SIMD version.
Footnote 1: SIMD popcount and Harley-Seal
See http://0x80.pl/articles/sse-popcount.html / https://github.com/WojciechMula/sse-popcount for details of the strategy, which they call sse-lookup
or avx2-lookup
. Along with others such as harley-seal which accumulate weighted counters across vectors of input data to amortize the lookup costs.
If you're summing this weighted-popcount across multiple inputs, perhaps Harley-Seal could be even more useful than for normal popcnt. It should be a nearly drop-in replacement since Harley-Seal already works by keeping counts for each bit-position separate in its temporaries. The vertical bitwise operations work like a 4-bit binary carry-save adder, with 1 bit in each vector.
So actually 128 or 256 of these counters in parallel, a separate one for each bit position. Inside the loop you do total += 16 * popcount(sixteens);
after processing 16 vectors of input, keeping the ones, twos, fours, and eights vectors rolling across iterations. See https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-sse-harley-seal.cpp
A weighted sum only has to change the actual popcount
step to weighted_popcount
; the bitwise operations stay the same cost. With AVX2 and normal popcount on Skylake, Harley-Seal is only a bit faster than vpshufb
lookup over decent-sized arrays (like 2048 bytes). On Intel CPUs, AVX2 lookup and Harley-Seal are almost twice as fast as scalar popcnt
. That might not be the case for AMD Zen where scalar popcnt
has 3 to 4 per clock throughput (and loads have 3/clock throughput on Zen 3 and later). (https://uops.info/)
Making the popcount step more complicated (and slower) only affects a small part of the cost of Harley Seal, but almost all the work in a pure lookup version that doesn't use Harley-Seal, so it's much more attractive.
Other possible ideas:
If you were summing over many input n
values, positional popcount could set you up to apply the weights at the end. https://github.com/lemire/FastFlagStats and https://github.com/mklarqvist/positional-popcount
But probably it's more work to keep all 64 counts fully separate until the end, instead better to only partially apply these techniques and combine according to your weights at times. I expect Harley-Seal is the best bet.
If AVX-512 were available - especially Ice Lake AVX512VPOPCNTDQ or AVX512BITALG, you could use vpopcntb
instead of vpshufb
, using similar ideas to what I did for SSSE3.
Clang can optimize the 6x popcnt
version into asm that does two scalar popcounts, and one AVX2 popcount of the input broadcasted and masked 4 different ways, in different elements of a YMM vector. In general it can vectorize sums of popcounts, including this case of the sum of 6 popcounts.