Most efficient popcount on `__uint128_t`?

Question

I need to popcnt in the most efficient (fastest) way an unsigned variable of 128 bits in size.

OS: Linux/Debian 9
Compiler: GCC 8
CPU: Intel i7-5775C

Although if solution is more portable, even better.

First of all, there are two types in GCC, which are __uint128_t and unsigned __int128. I guess they end up being the same, and see no reason to write the ugly unsigned __int128 thing, so although it is supposed to be the new type, I prefer the first one, which is more similar to the standard uint64_t. Also, Intel has __uint128_t which is another reason to use it (portability).

I have written the following code:

#include <nmmintrin.h>
#include <stdint.h>

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const uint64_t      n_hi    = n >> 64;
    const uint64_t      n_lo    = n;
    const uint_fast8_t  cnt_hi  = _mm_popcnt_u64(n_hi);
    const uint_fast8_t  cnt_lo  = _mm_popcnt_u64(n_lo);
    const uint_fast8_t  cnt     = cnt_hi + cnt_lo;

    return  cnt;
}

Is this the absolute fastest option?

Edit:

Another option came out of my mind, which may (or not) be faster:

#include <nmmintrin.h>
#include <stdint.h>

union   Uint128 {
    __uint128_t uu128;
    uint64_t    uu64[2];
};

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const union Uint128 n_u     = {.uu128   = n};
    const uint_fast8_t  cnt_a   = _mm_popcnt_u64(n_u.uu64[0]);
    const uint_fast8_t  cnt_b   = _mm_popcnt_u64(n_u.uu64[1]);
    const uint_fast8_t  cnt     = cnt_a + cnt_b;

    return  cnt;
}

This way, although I don't know if it is legal (is it? (Edit: it is: Type punning between integer and array using `union`?)), I would avoid the shift.

To get slightly more portable regarding architectures, use `__builtin_popcountll` instead of `_mm_popcnt_u64`. If you have SSE4.2 available, they should generate the same code, even with old versions of gcc/clang: https://godbolt.org/z/3We1ip (but it falls back to something that works and should be reasonably fast on any target architecture). — chtz, Mar 05 '19 at 18:39
I don't trust `__builtin_popcountll` because it uses `long long` instead of `uint64_t`. I think it is insane to create a function that deals with bits and uses a type that isn't of fixed width. I don't know what GCC people were thinking about. — alx - recommends codidact, Mar 05 '19 at 18:43
I don't think it's getting substantially faster than this. Using `POPCNT` is the way to go. IMHO If you want to optimize, concentrate on cache-line (and) alignment. — zx485, Mar 05 '19 at 19:04
The thing is that Intel has other register functions which I don't know, and could speed this up (for example some function to extract the high half of the `__uint128_t`). Maybe GCC already optimizes to use them (I don't know). I'll read about cache-line alignment. — alx - recommends codidact, Mar 05 '19 at 19:08
@zx485 I'm using a very big static array. Do I have to care about cache, or does GCC handle that for me? I thought that was only a problem with `malloc` and that stuff. GCC knows the array at compile time, so it can do that better than me. — alx - recommends codidact, Mar 05 '19 at 19:13
If GCC really takes care of that stuff, it's probably ok. But this is beyond my experience. — zx485, Mar 05 '19 at 19:16
In gcc's headers, `_mm_popcnt_u64` also takes `unsigned long long` as parameter: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/popcntintrin.h You can first test if `sizeof(long long) >=64` if you want to, and fall back to something else, otherwise ... — chtz, Mar 05 '19 at 19:45
And using `union` for type-pruning is save in C (not in C++, however), but I doubt you gain anything with that, since gcc will usually store `__uint128_t` variables in two 64bit registers anyway (perhaps, in some cases also in SSE registers). — chtz, Mar 05 '19 at 19:48
It was just to avoid the shift, which I don't know if the compiler will be able to optimize away by detecting that it is just taking one of the registers. — alx - recommends codidact, Mar 05 '19 at 19:56
@chtz Even when mixing arrays with integers? That has diverged a lot from this question, so I'll ask a new one. — alx - recommends codidact, Mar 05 '19 at 20:04
Asked that here: https://stackoverflow.com/q/55010795/6872717 — alx - recommends codidact, Mar 05 '19 at 20:11
If you care about performance, you need to benchmark and to look at the code your compiler generates (at least in critical loops). Don't just guess what your compiler is capable of, and optimize prematurely ... Also, when benchmarking, make sure to test your function inside a realistic context. How that function alone is compiled does not mean anything. — chtz, Mar 05 '19 at 20:37
@CacahueteFrito The x86 tag includes x86-64. There are many more people who watch x86 than x86-64. So using the x86 tag helps your post to reach more people. Anyway, it's your choice. — Hadi Brais, Mar 05 '19 at 20:40
Of course I will test both and see which wins. It just came to my mind. Also the punning, now that I know is legal, shows the intentions better that the shift, I think. But the definitive version will of course be the result of the benchmarks. — alx - recommends codidact, Mar 05 '19 at 21:12
@HadiBrais This code is specific to `x86-64` (32-bit versions of `x86` don't have `__uint128_t`). — alx - recommends codidact, Mar 05 '19 at 21:14
@chtz `long long` is required to have at least 64 bits by the C and C++ standards, no need to check. \@CacahueteFrito there's nothing wrong with the GCC guys. They're providing supports for normal standard types and that guarantees it to work on more platforms than fixed-width types — phuclv, Mar 06 '19 at 01:41

Peter Cordes · Accepted Answer · 2019-03-06T00:17:20.043

10

With GCC and clang, both your functions compile to identical asm if you remove the static inline, and presumably will inline equivalently.

I'd suggest using unsigned, because sizeof(uint_fast8_t) = 1 on x86-64 Linux. The _fast types beg the question "fast for what purpose"; fast8 is good for compact storage in arrays, fast32 is a 64-bit type which maybe avoids redoing sign or zero extension for pointer math but wastes space in array.

clang knows that the sum of two popcnt results fit in an 8-bit integer without overflow, so it can optimize away zero-extension even if you sum the result into an unsigned counter, but gcc doesn't. (e.g. change the return type to unsigned and you'll get an extra movzx eax, dil instruction.) The hardware popcnt instruction produces a result that's correctly zero-extended to 64-bit, but assigning to uint_fast8_t aka uint8_t is explicitly asking the compiler to truncate results to 8-bit.

The x86-64 System V ABI allows high garbage in args and return values, so when the return type is narrow a stand-alone version of the function can allow carry into the high bits of EAX.

I would avoid the shift.

The shift only exists in the C source. In the asm, the high/low halves will be stored in separate 64-bit registers, or separate memory source operands.

From the Godbolt compiler explorer

# gcc8.3 -O3 -march=haswell  for the union and the shift version
popcnt_u128:
    xor     eax, eax    # break popcnt's false dependency on Intel CPUs
    popcnt  rsi, rsi    # _mm_popcnt_u64(n_hi);
    popcnt  rax, rdi    # popcnt(lo)
    add     eax, esi        # clang uses add al,cl  and doesn't avoid false deps except in a loop
    ret                 # return value in AL (low 8 bits of EAX)

GCC could have avoided the xor-zeroing by doing both popcnts in place, and using lea eax, [rdi + rsi]. But you said something about an array, so if the data is coming from memory then GCC will normally mov-load and then popcnt in place to avoid the false dependency. (Why does breaking the "output dependency" of LZCNT matter?) Or actually, it will xor-zero the destination and then use memory-source popcnt, which might be slightly smaller code-size.

I don't trust __builtin_popcountll because it uses long long instead of uint64_t. I think it is insane to create a function that deals with bits and uses a type that isn't of fixed width. I don't know what GCC people were thinking about.

It actually uses unsigned long long, not signed long long; that would be insane.

unsigned long long is at least 64 bits, and uint64_t is required to be exactly 64 bits. (And in fact only exists on C implementations that have a type that's exactly 64 bits with no padding; support for it is optional). I'm not sure if GNU C supports any targets where unsigned long long isn't 64 bits, or where uint64_t isn't available. Or even int64_t, which is also required to be 2's complement. (IDK if GCC supports any non-2's-complement targets.)

You can cast the inputs to uint64_t to make sure there are no higher bits set. Implicit conversion from uint64_t to unsigned long long won't set any extra bits, even on a platform where ULL is wider than 64 bits.

e.g. __builtin_popcountll( (uint64_t)n ); will always safely count the low 64 bits of n, regardless of the width of unsigned long long.

I'm using a very big static array. Do I have to care about cache, or does GCC handle that for me? I thought that was only a problem with malloc and that stuff. GCC knows the array at compile time, so it can do that better than me.

GCC will (almost?) never re-arrange your loops to change memory access patterns. Static arrays are not substantially different from malloced memory; they don't stay hot in cache for free. See What Every Programmer Should Know About Memory? to learn more.

But if you're just looping sequentially through memory and popcounting a whole array, then it doesn't really matter whether you do it with __uint128_t or not.

clang will auto-vectorize __builtin_popcntll or _mm_popcnt_u64 over an array with AVX2 vpshufb (as a nibble LUT), which is good on Intel CPUs including your Broadwell. See Counting 1 bits (population count) on large data using AVX-512 or AVX-2

But unfortunately using your wrapper function for an array of __uint128_t defeats that. See the last 2 functions in the Godbolt link.

edited Mar 06 '19 at 00:17

answered Mar 06 '19 at 00:07

Peter Cordes

328,167
45
605
847

1) I'm really surprised that `uint_fast8_t` is `uint8_t`. I imagined that it would be `uint64_t` which is the fastest type capable of holding 8 bits. So `unsigned` is **faster** than `uint_fast8_t`, right? Braindead compilers! They do easily the hard things, and screw the easy ones. – alx - recommends codidact Mar 06 '19 at 00:39
2) Still don't like the idea of having to check that `unsigned long long` is at least 64 bits. `` exists for a reason. Of course `__builtin_popcountll((uint64_t)n);` will work, but I think it is still insane to have to do that to use that function safely. – alx - recommends codidact Mar 06 '19 at 00:42
3) The reason I use `__uint128_t` is because I do calculations with matrices of 128x128 bits, which pack nicely into a `{__uint128_t [128]}`, and it's a nice abstraction to work with it. In reality I don't even work with that, but with arrays of arrays of that. The other solution would be to use a `{uint64_t [128][2]}`, but working with that sucks. – alx - recommends codidact Mar 06 '19 at 00:45
4) Maybe a solution would be to wrap the entire array with a `union Wrapper {__uint128_t uu128[128]; uint64_t uu64[128][2];};` to let the compiler use `_mm_popcnt_u64()` over an entire array of `uint64_t`. – alx - recommends codidact Mar 06 '19 at 00:54
@CacahueteFrito: you don't have to check that `unsigned long long` is at least 64 bits. ISO C guarantees that. A compiler isn't allowed to provide an `unsigned long long` that's narrower. The reason `(uint64_t)n` works is that it truncates to 64-bit in case `n` was wider. You can get the same effect from using a `uint64_t` temporary to hold the input for popcnt. – Peter Cordes Mar 06 '19 at 03:48
1

@CacahueteFrito: If you want max performance on x86, you might consider writing intrinsics manually, if not everything auto-vectorizes. (e.g. gcc doesn't auto-vectorize popcnt at all). But unfortunately manual SIMD popcnt isn't worth it with `__m128i` (which could replace `__uint128_t` directly), only `__m256i`, so it's only good if you're looping over rows of a matrix. – Peter Cordes Mar 06 '19 at 03:54
I meant check it in the Standard. I don't remember easily that `unsigned long long` is `uint_least64_t`, which is the reason why `uint_least64_t` was invented. – alx - recommends codidact Mar 06 '19 at 04:34
I was limiting myself to 128 bits because I didn't know that the possibility of 256 bits even existed. Please elaborate on that. I'm really interested :) – alx - recommends codidact Mar 06 '19 at 04:36
I will not popcount the entire array (`uint256_t [256]`) at once, but someone may be interested in that, so it's wort mentioning. But I would be interested in popcounting each row consecutively and store an array with the cnt of each row. – alx - recommends codidact Mar 06 '19 at 04:41
1

@CacahueteFrito: hmm, I think there's some hope for getting a speedup with AVX2 for counting each group of 16 bytes separately. But IIRC, it doesn't win by much over scalar, and packing down to a `uint8_t` count might eat up the speedup. For more about popcnt with 256-bit vectors, see the [Counting 1 bits (population count) on large data using AVX-512 or AVX-2](//stackoverflow.com/q/50081465) Q&A I linked in the question. It's not a 256-bit integer type, it's a SIMD vector. – Peter Cordes Mar 06 '19 at 05:21
I don't need to store the count into a uint8_t. I can use any variable for that. – alx - recommends codidact Mar 06 '19 at 05:26
But you don't want to store the per-row counts into a `__uint128_t`, that would waste a ton of space in cache. Packing down to `uint32_t` or `uint64_t` might make the shuffling slightly cheaper after a `vpsadbw ymm, _mm256_setzero_si256()` , e.g. a single `vpermd` to grab the per-64-bit counts into the low 128-bit lane, where you could maybe set up for a horizontal sum with a `vpsrlq` by 32 bits instead of using yet another shuffle... But if it's worth caching the results of popcnt instead of computing on the fly every time, it's probably worth packing them down to `uint8_t` – Peter Cordes Mar 06 '19 at 05:32
Yes, I was thinking of a `uint64_t`. I was using `uint_fast8_t`, because its name made me think it would be a `uint64_t`. – alx - recommends codidact Mar 06 '19 at 05:37
1

Oh, maybe `vpackusdw` (`_mm256_packus_epi16`) can usefully combine multiple row results to eventually sort out with one lane-crossing shuffle later, maybe even a `vphaddw` `_mm256_hadd_epi16` to shuffle 2 more pairs of vectors together and horizontally add words. Yup, then a `vpermq` lane-crossing fixup can put things in the right order after we pack down to unsigned bytes, setting up for a 256-bit store of 32 `uint8_t` results for 32 rows. That might or might not end up being faster than scalar on Broadwell, but I think *probably* with a well-optimized AVX2 popcnt. – Peter Cordes Mar 06 '19 at 05:37
Could you please add some samples of code to your answer? That sounds like Chinese to me :) – alx - recommends codidact Mar 06 '19 at 06:00
That would be a separate question. This question only asked about *one* `__uint128_t`. If you describe your actual problem and what you need the array of per-uint128 results for (in a new question), you might get a useful answer. If you don't know AVX2 intrinsics at all, you're not going to be able to maintain, debug, or understand the code yourself, though. – Peter Cordes Mar 06 '19 at 06:55

Most efficient popcount on `__uint128_t`?

1 Answers1