Counting 1 bits (population count) on large data using AVX-512 or AVX-2

Question

I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes.

I know that AVX-512 has a VPOPCNTDQ instruction which counts the number of 1 bits in each consecutive 64 bits within a 512-bit vector, and IIANM it should be possible to issue one of these every cycle (if an appropriate SIMD vector register is available) - but I don't have any experience writing SIMD code (I'm more of a GPU guy). Also, I'm not 100% sure about compiler support for AVX-512 targets.

On most CPUs, still, AVX-512 is not (fully) supported; but AVX-2 is widely-available. I've not been able to find an less-than-512-bit vectorized instruction similar to VPOPCNTDQ, so even theoretically I'm not sure how to count bits fast with AVX-2 capable CPUs; maybe something like this exists and I just missed it somehow?

Anyway, I'd appreciate a short C/C++ function - either using some intristics-wrapper library or with inline assembly - for each of the two instruction sets. The signature is

uint64_t count_bits(void* ptr, size_t size);

Notes:

Related to How to quickly count bits into separate bins in a series of ints on Sandy Bridge? , but not a dupe.
We can assume the input is well-aligned, if that matters.
Forget about multiple cores or sockets, I want code for a single (thread on a single) core.

As far as I know, there is not yet any silicone out there that supports `vpopcntdq`. Neither AVX2 nor SSE have a similar instruction, though a scalar `popcnt` exists. Refer to [this article]()http://0x80.pl/articles/sse-popcount.html for some more ideas. — fuz, Apr 28 '18 at 22:06
@fuz: Knight's Mill supposedly has it already, and those have been out since last year. — einpoklum, Apr 28 '18 at 22:18
Oh yeah, I totally forgot about these. Not something you are going to see in the wild though. — fuz, Apr 28 '18 at 22:38
I can confirm that the `avx2-lookup` method from the [article](http://0x80.pl/articles/sse-popcount.html) is the most efficient for buffers of sizes in the range 64 KB - 512 MB. You can even do much better by partitioning the array across multiple threads and then adding up all the local popcounts. — Hadi Brais, Apr 29 '18 at 00:06
related: [Bit popcount for large buffer, assembly preferred](https://stackoverflow.com/q/3693981/995714) — phuclv, Apr 29 '18 at 12:07
@LưuVĩnhPhúc: Thanks for the link to how things looked in 2010/2011... — einpoklum, Apr 29 '18 at 12:13

einpoklum · Accepted Answer · 2020-12-29T21:14:11.970

AVX-2

@HadiBreis' comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It's based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles.

#   include <immintrin.h>
#   include <x86intrin.h>

std::uint64_t popcnt_AVX2_lookup(const uint8_t* data, const size_t n) {

    size_t i = 0;

    const __m256i lookup = _mm256_setr_epi8(
        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4,

        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
    );

    const __m256i low_mask = _mm256_set1_epi8(0x0f);

    __m256i acc = _mm256_setzero_si256();

#define ITER { \
        const __m256i vec = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i)); \
        const __m256i lo  = _mm256_and_si256(vec, low_mask); \
        const __m256i hi  = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \
        const __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo); \
        const __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi); \
        local = _mm256_add_epi8(local, popcnt1); \
        local = _mm256_add_epi8(local, popcnt2); \
        i += 32; \
    }

    while (i + 8*32 <= n) {
        __m256i local = _mm256_setzero_si256();
        ITER ITER ITER ITER
        ITER ITER ITER ITER
        acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));
    }

    __m256i local = _mm256_setzero_si256();

    while (i + 32 <= n) {
        ITER;
    }

    acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));

#undef ITER

    uint64_t result = 0;

    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 0));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 1));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 2));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 3));

    for (/**/; i < n; i++) {
        result += lookup8bit[data[i]];
    }

    return result;
}

AVX-512

The same repository also has a VPOPCNT-based AVX-512 implementation. Before listing the code for it, here's the simplified and more readable pseudocode:

For every consecutive sequence of 64 bytes:

Load the sequence into a SIMD register with 64x8 = 512 bits

Perform 8 parallel population counts of 64 bits each on that register

Add the 8 population-count results in parallel, into an "accumulator" register holding 8 sums

Sum up the 8 values in the accumulator

If there's a tail of less than 64 bytes, count the bits there in some simpler way

Return the main sum plus the tail sum

And now for the real deal:

#   include <immintrin.h>
#   include <x86intrin.h>

uint64_t avx512_vpopcnt(const uint8_t* data, const size_t size) {
    
    const size_t chunks = size / 64;

    uint8_t* ptr = const_cast<uint8_t*>(data);
    const uint8_t* end = ptr + size;

    // count using AVX512 registers
    __m512i accumulator = _mm512_setzero_si512();
    for (size_t i=0; i < chunks; i++, ptr += 64) {
        
        // Note: a short chain of dependencies, likely unrolling will be needed.
        const __m512i v = _mm512_loadu_si512((const __m512i*)ptr);
        const __m512i p = _mm512_popcnt_epi64(v);

        accumulator = _mm512_add_epi64(accumulator, p);
    }

    // horizontal sum of a register
    uint64_t tmp[8] __attribute__((aligned(64)));
    _mm512_store_si512((__m512i*)tmp, accumulator);

    uint64_t total = 0;
    for (size_t i=0; i < 8; i++) {
        total += tmp[i];
    }

    // popcount the tail
    while (ptr + 8 < end) {
        total += _mm_popcnt_u64(*reinterpret_cast<const uint64_t*>(ptr));
        ptr += 8;
    }

    while (ptr < end) {
        total += lookup8bit[*ptr++];
    }

    return total;
}

The lookup8bit is a popcnt lookup table for bytes rather than bits, and is defined here. edit: As commenters note, using an 8-bit lookup table at the end is not a very good idea and can be improved on.

I didn't experiment with `avx512-harley-seal` or `avx512-vpopcnt`, so I don't know whether they are faster. — Hadi Brais, Apr 29 '18 at 00:30
I think you have provided a link to `avx2-harley-seal` instead of `avx2-lookup` by mistake. — Hadi Brais, Apr 29 '18 at 00:30
@HadiBrais: You're right; but - you could just edit the link :-P — einpoklum, Apr 29 '18 at 00:36
I expect `avx512-vpopcnt` to be faster for large buffer sizes, but we need an [Ice Lake](https://en.wikipedia.org/wiki/Ice_Lake_(microarchitecture)) or [Knights Mill](https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill) processor to experiment. — Hadi Brais, Apr 29 '18 at 00:43
@HadiBrais AFAIK, the Harley-Seal version is only good on AVX512 CPUs without `vpopcnt` but with AVX512BW. i.e. Skylake-AVX512 but not KNL. See my answer on [Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?](//stackoverflow.com/q/45376006). If IceLake only runs VPOPCNT on one port, then possibly a hybrid approach could keep both vector ALUs busy, but assuming VPOPCNT is a single-uop instruction, it definitely wins. Harley-Seal takes 30x VPTERNLOGD + 1 vector popcnt for 16x ZMM vectors. — Peter Cordes, Apr 29 '18 at 01:42
@PeterCordes: Wouldn't you use the other port for the addition of the previous counts to the running sums? — einpoklum, Apr 29 '18 at 01:47
@einpoklum: oh, derp, yes. You need one `vpaddq` for every `vpopcntq`, so that keeps both ports busy, assuming Ice Lake shuts down port 1 when any 512-bit uops are in the scheduler, like SKX does. — Peter Cordes, Apr 29 '18 at 01:48
I think it's very likely that any Intel CPUs with `vpopcntq` will run it as a single uop, otherwise it's barely worth it. (And it's not part of a big group of other instruction, so CPUs that don't want to spend the transistors can just leave out that feature bit). If AMD builds an AVX512 CPU with 256b execution units, it might be 2 uops but so would vpternlogd. Haswell runs `vdivps ymm` as 3 uops, on a half-width divider + a merge uop I guess. So it's unlikely Intel could run it as 2 uops. Even if it was 2, Harley-Seal still be a tiny win for lots of vectors, and a bigger win otherwise. — Peter Cordes, Apr 29 '18 at 02:03
The `while (ptr < end)` single-byte loop doesn't look like a good cleanup strategy. Much better: unaligned load of the last 8 bytes of the buffer, then right-shift by 8 * misalignment count, and popcnt that. i.e. `total += _mm_popcnt_u64( *reinterpret_cast(end)[-1] >> (8*(end-ptr)) );`. (Or more sophisticated logic to avoid page-crossing for a very small input at the start of a page). A 256-entry LUT you use only a couple times at the end is likely to cache-miss. — Peter Cordes, Apr 29 '18 at 02:11
@PeterCordes: Consider making that another answer - or editing this suggestion into mine, as an extra section. — einpoklum, Apr 29 '18 at 08:20

score 3 · Answer 2 · answered Apr 29 '18 at 09:35

Wojciech Muła's big-array popcnt functions look optimal except for the scalar cleanup loops. (See @einpoklum's answer for details on the main loops).

A 256-entry LUT you use only a couple times at the end is likely to cache-miss, and isn't optimal for more than 1 byte even if cache was hot. I believe all AVX2 CPUs have hardware popcnt, and we can easily isolate the last up-to-8 bytes that haven't been counted yet to set us up for a single popcnt.

As usual with SIMD algorithms, it often works well to do a full-width load that ends at the last byte of the buffer. But unlike with a vector register, variable-count shifts of the full integer register are cheap (especially with BMI2). Popcnt doesn't care where the bits are, so we can just use a shift instead of needing to construct an AND mask or whatever.

// untested
// ptr points at the first byte that hasn't been counted yet
uint64_t final_bytes = reinterpret_cast<const uint64_t*>(end)[-1] >> (8*(end-ptr));
total += _mm_popcnt_u64( final_bytes );
// Careful, this could read outside a small buffer.

Or even better, use more sophisticated logic to avoid page-crossing. This can avoid page-crossing for a 6-byte buffer at the start of a page, for example.

I get that shifted-load trick, sure. But why would I try to avoid page crossing? I mean, if my data crosses pages, why should I not cross them? — einpoklum, Apr 29 '18 at 11:21
@einpoklum: you don't want one load to cross a page boundary. On Skylake, it actually only costs about the same as a cache-line split, but on earlier CPUs it costs an extra ~100 cycles. And if your buffer is only 6 bytes long, an 8-byte load lined up with the *end* of the buffer will cross into the previous page where none of your data is, potentially segfaulting. (Working on an update to talk about Harley-Seal and say which version is probably optimal on which CPU. I'll clarify that page-crossing point, too.) — Peter Cordes, Apr 29 '18 at 11:24

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

2 Answers2

AVX-2

AVX-512

Linked