Population count in AVX512

Question

I have been trying to use _mm256_popcnt_epi64 on a machine that supports AVX512 and on code that has previously been optimiized for AVX2.

Unfortunately, I ran into the issue that the function isn't found. The corresponding __m512i equivalent is found, however. Is the __m256i function deprecated?

Peter Cordes · Accepted Answer · 2022-02-25T22:11:01.603

_mm512_popcnt_epi64 is part of AVX512-VPOPCNTDQ. The 256 and 128-bit versions also require AVX512VL to use AVX512 instructions with 128 or 256-bit vectors.

Mainstream AVX512 CPUs all have AVX512-VL. Xeon Phi CPUs don't have AVX512-VL.

(_mm512_popcnt_epi8 and epi16 are also new in Ice Lake, as part of AVX512-BITALG)

Perhaps you forgot to enable the compiler options necessary (like GCC -march=native to enable everything the machine you're compiling on can do), or you're compiling for a target that doesn't have both features. If so, then the compiler won't have a definition for _m256_popcnt_epi64 as an intrinsic, so in C it will assume its and undeclared function and emit a call to it. (Which will of course be not found at link time.) And/or it will warn or error (C or C++) about a prototype not being found.

Very few CPUs currently have AVX512-VPOPCNTDQ (wikipedia AVX512 feature vs. CPU matrix):

Knight's Mill (final-generation Xeon Phi): only AVX512-VPOPCNTDQ, no AVX512VL and no BITALG. So only the __m512i versions are available for gcc -O3 -march=knm. You should definitely be using 512-bit vectors on Xeon Phi unless data layout works perfectly for 256 and would take extra shuffling for 512-bit. But beware that it's slow for some AVX / AVX2 instructions that it doesn't have 512-bit versions of, like shuffles with elements smaller than 32-bit. (No AVX512 BW)
Ice Lake / Tiger Lake: has AVX512 VPOPCNTDQ, BITALG, and AVX512 VL, so _mm256_popcnt_epi64 and epi8 are supported when compiling for this target microarchitecture, e.g. gcc -O3 -march=icelake-client. (Assuming your compiler's headers are correct).

GCC8.3 and earlier have a bug where -march=icelake-client / icelake-server doesn't enable -mavx512vpopcntdq. (GCC7 doesn't know about -march=icelake-client). It's fixed in GCC8.4, so either upgrade to the latest GCC8, or better upgrade to the latest stable GCC; a couple more years of development should usually help GCC make better code with new ISA extensions like AVX-512, especially with mask registers. Or just manually use -march=icelake-client -mavx512vpopcntdq; that does work: https://godbolt.org/z/a7bhcjdhr

Choosing between 256 vs. 512-bit vectors on Ice Lake is a tradeoff like on Skylake-x: when 512-bit vector uops are in flight, the vector ALUs on port 1 don't get used. And max turbo clock speed may be lowered. SIMD instructions lowering CPU frequency. So if you don't get much speedup from wider vectors (e.g. because of a memory bottleneck, or your SIMD loops are only a tiny part of a larger program), it can hurt overall performance to use 512-bit vectors in one loop.

But note that Icelake Client CPUs aren't affected much, and I'm not sure if vpopcnt instructions even count as "heavy", maybe not reducing max turbo as much, if at all on client CPUs. Most integer SIMD instructions don't count. See discussion on LLVM [X86] Prefer 512-bit vectors on Ice/Rocket/TigerLake (PR48336). The vector ALU part of port 1 still shuts down while 512-bit uops are in flight, though.

Other CPUs don't have hardware SIMD popcnt support at all, and no form of _mm512_popcnt_epi64 is available.

Even if you only have AVX2, not AVX512 at all, SIMD popcnt is a win vs. scalar popcnt, over non-tiny arrays on modern CPUs with fast vpshufb (_mm256_shuffle_epi8). https://github.com/WojciechMula/sse-popcount/ has AVX2, and AVX512 versions that use vpternlogd for Harley-Seal accumulation to reduce the amount of SIMD LUT lookups for popcounting.

Also on Stack Overflow Counting 1 bits (population count) on large data using AVX-512 or AVX-2 shows some code copied from that repo a couple years ago.

If you need counts for separate elements separately, just use the standard unpack for vpshufb and vpsadbw against a zero vector to hsum into 64-bit qword chunks.

If you need positional popcount (separate sum for each bit-position), see https://github.com/mklarqvist/positional-popcount.

Great, thanks a lot! I was working on Skylake and the intrinsics guide made me think that AVX512VL is enough. — Alex, May 19 '20 at 08:03
@Alex: That doesn't explain `_mm512_popcnt_epi64` working for you. It should also compiler error, or fault as an illegal instruction if you use compiler options that let it compile. So check your build options. — Peter Cordes, May 19 '20 at 08:08
My code is part of a larger project and I got casting errors when compiling when I tried to popcnt on `__m256i'. That's what made me think that this function is available. I will try to check it explicitly. — Alex, May 19 '20 at 08:17
@Alex: That made you think the `__m512i` version was available? Ok. No it isn't available with `-march=skylake-avx512`. I was just worried something else was totally wrong with your build options or test setup, but making wrong assumptions explains it just fine. — Peter Cordes, May 19 '20 at 08:19
@PeterCordes : I am getting error on ice-lake `/usr/lib/gcc/x86_64-linux-gnu/8/include/avx512vpopcntdqvlintrin.h:117:1: error: inlining failed in call to always_inline ‘__m256i _mm256_popcnt_epi64(__m256i)’: target specific option mismatch` . checked avx512 and other variants are present and also compiler flags -O3 -march=native is on. any idea what am i missing , any headers ? — user179156, Feb 25 '22 at 12:16
@user179156: Are you maybe running in a VM that doesn't pass through AVX-512 in general, or some of the sub-features? Since you say you have avx512, that rules out an an Ice-lake Pentium or Celeron that only have AVX2. `-march=icelake-client` and `-march=icelake-server` both support it, but if your VM doesn't pass through AVX-512F then a binary built that way won't run. (If a VM does pass through AVX-512F but fails to pass through some other AVX-512F feature bits, `-march=native` won't see it; it believes CPUID, but the instructions will actually run without faulting.) — Peter Cordes, Feb 25 '22 at 12:50
https://godbolt.org/z/74sx9TY3G : here it doesnt compile when i use `-march=native` but not when i use `-march=icelake-client` or `-march=icelake-server` — user179156, Feb 25 '22 at 18:30
it seems like its a gcc version issue ,doesnt work with gcc 8 but compiles with 9 and later — user179156, Feb 25 '22 at 20:28
@user179156: Not surprising for Godbolt; according to `-fverbose-asm` output (https://godbolt.org/z/aaEoorYs7), `-march=native` there is `-march=skylake-avx512`; it runs on AWS instances with SKX or Cascade Lake CPUs. For now anyway; that's why it warns that `-march=native` depends on whatever AWS instance it runs on. — Peter Cordes, Feb 25 '22 at 21:52
@user179156: As for `-march=icelake-client`, yeah interesting, seems GCC8.3 and earlier have a bug where they think ICL doesn't have `-mavx512vpopcntdq` (But they do enable `-mavx512bitalg`, so amusingly it does support `_mm256_popcnt_epi8` and epi16). https://godbolt.org/z/YMqGGs67G. If `-march=native` detection of extensions from CPUID is also affected by that bug, that would break things even for local usage. (GCC7 and earlier don't know about `-march=icelake-client` at all.) — Peter Cordes, Feb 25 '22 at 21:54
@user179156: Thanks for bringing that bug in older GCC to my attention; updated my answer in case future readers have the same problem. Turns out `-march=icelake-client -mavx512vpopcntdq` does work with GCC8.3 — Peter Cordes, Feb 25 '22 at 22:09

Population count in AVX512

1 Answers1

Linked