Bloated code generated for __builtin_popcnt when -mavx2 is on

Question

For such a function, clang (and sometimes gcc in certain contexts that I cannot reproduce minimally) seems to generate bloated code when -mavx2 switch is on.

unsigned count(uint64_t *f) {
    unsigned c = 0;
    for (unsigned i = 0; i < 1024; ++i) {
        if (sizeof(long) >= 8) {
            c += __builtin_popcountl(f[i]);
        } else {
            c += __builtin_popcountll(f[i]);
        }
    }
    return c;
}

This is from gcc and it's quite straightforward.

count:
        lea     rcx, [rdi+8192]
        xor     eax, eax
.L2:
        xor     edx, edx
        add     rdi, 8
        popcnt  rdx, QWORD PTR [rdi-8]
        add     eax, edx
        cmp     rcx, rdi
        jne     .L2
        ret

However clang decides to generate this massive bloat when -mavx2 is on. -mpopcnt was also set.

.LCPI0_0:
        .zero   32,15
.LCPI0_1:
        .byte   0                               # 0x0
        .byte   1                               # 0x1
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   3                               # 0x3
        .byte   4                               # 0x4
        .byte   0                               # 0x0
        .byte   1                               # 0x1
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   3                               # 0x3
        .byte   4                               # 0x4
count:                                  # @count
        vpxor   xmm0, xmm0, xmm0
        xor     eax, eax
        vmovdqa ymm1, ymmword ptr [rip + .LCPI0_0] # ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
        vmovdqa ymm2, ymmword ptr [rip + .LCPI0_1] # ymm2 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
        vpxor   xmm12, xmm12, xmm12
        vpxor   xmm4, xmm4, xmm4
        vpxor   xmm5, xmm5, xmm5
        vpxor   xmm6, xmm6, xmm6
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm7, ymmword ptr [rdi + 8*rax]
        vmovdqu ymm8, ymmword ptr [rdi + 8*rax + 32]
        vmovdqu ymm9, ymmword ptr [rdi + 8*rax + 64]
        vmovdqu ymm10, ymmword ptr [rdi + 8*rax + 96]
        vpand   ymm11, ymm7, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm7, ymm7, 4
        vpand   ymm7, ymm7, ymm1
        vpshufb ymm7, ymm2, ymm7
        vpaddb  ymm7, ymm11, ymm7
        vpsadbw ymm7, ymm12, ymm7
        vpand   ymm11, ymm8, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm8, ymm8, 4
        vpand   ymm8, ymm8, ymm1
        vpshufb ymm8, ymm2, ymm8
        vpaddb  ymm8, ymm8, ymm11
        vpsadbw ymm8, ymm8, ymm12
        vpand   ymm11, ymm9, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm9, ymm9, 4
        vpand   ymm9, ymm9, ymm1
        vpshufb ymm9, ymm2, ymm9
        vpaddb  ymm9, ymm9, ymm11
        vpsadbw ymm9, ymm9, ymm12
        vpand   ymm11, ymm10, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm10, ymm10, 4
        vpand   ymm10, ymm10, ymm1
        vpshufb ymm10, ymm2, ymm10
        vpaddb  ymm10, ymm10, ymm11
        vpsadbw ymm10, ymm10, ymm12
        vextracti128    xmm3, ymm7, 1
        vpackusdw       xmm3, xmm7, xmm3
        vpaddd  xmm0, xmm0, xmm3
        vextracti128    xmm3, ymm8, 1
        vpackusdw       xmm3, xmm8, xmm3
        vpaddd  xmm4, xmm4, xmm3
        vextracti128    xmm3, ymm9, 1
        vpackusdw       xmm3, xmm9, xmm3
        vpaddd  xmm5, xmm5, xmm3
        vextracti128    xmm3, ymm10, 1
        vpackusdw       xmm3, xmm10, xmm3
        vpaddd  xmm6, xmm6, xmm3
        add     rax, 16
        cmp     rax, 1024
        jne     .LBB0_1
        vpaddd  xmm0, xmm4, xmm0
        vpaddd  xmm0, xmm5, xmm0
        vpaddd  xmm0, xmm6, xmm0
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        vzeroupper
        ret

clang's code is similar to gcc when only -mpopcnt is on, with a bit of unrolling.

count:                                  # @count
        xor     ecx, ecx
        xor     eax, eax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        popcnt  rdx, qword ptr [rdi + 8*rcx]
        add     edx, eax
        popcnt  rsi, qword ptr [rdi + 8*rcx + 8]
        add     esi, edx
        popcnt  rdx, qword ptr [rdi + 8*rcx + 16]
        popcnt  rax, qword ptr [rdi + 8*rcx + 24]
        add     edx, esi
        add     eax, edx
        add     rcx, 4
        cmp     rcx, 1024
        jne     .LBB0_1
        ret

According to this document (https://www.agner.org/optimize/instruction_tables.pdf), popcnt is a very cheap instruction on most architectures. Then why is clang generating such a bloat to replace popcnt when I clearly allowed to use it with -mpopcnt? The optimization level was all set to -O3.

Here is a link to godbolt (https://godbolt.org/z/4vWK33a7c).

If you want to compute the population of large arrays, consider implementing an algorithm like [Muła's](https://github.com/WojciechMula/sse-popcount). — fuz, Jan 13 '22 at 15:58

Peter Cordes · Accepted Answer · 2022-01-13T17:32:56.113

4

It's auto-vectorizing as well as unrolling, which is a performance win for large arrays (or would be if clang had less overhead), at least on Intel CPUs where popcnt is 1/clock, so 64 bits per clock. (AMD Zen has 3 or 4/clock popcnt, so with add instructions taking an equal amount of the 4 scalar-integer ALU ports, it could sustain 2/clock uint64_t popcnt+load and add.) https://uops.info/

But vpshufb is also 1/clock on Intel (or 2/clock on Ice Lake), and if it's the bottleneck that's 128 bits of popcount work per clock. (Doing table lookups for the low 4 bits of each of 32 bytes.) But it's certainly not going to be that good, with all the extra shuffling it's doing inside the loop. :/

This vectorization loses on Zen1 where the SIMD ALUs are only 256 bits wide, but should be a significant win on Intel, and maybe a win on Zen2 and later.

But looks like clang widens to 32-bit counts inside the inner loop with vpsadbw, so it's not as good as it could be. 1024x uint64_t is 256 __m256i vectors of input data, and clang is unrolling by 4 so the max count in any one element is only 64, which can't overflow.

Clang is unrolling a surprising amount, given how much work it does. The vextracti128 and vpackusdw don't make much sense to me, IDK why it would do that inside the loop. The simple way to vectorize without overflow risk is just vpsadbw -> vpaddq or vpaddd, and it's already using vpsadbw for horizontal byte sums within 8-byte chunks. (A better way is to defer that until just before the byte elements could overflow, so do a few vpaddb. Like in How to count character occurrences using SIMD, although the byte counters are only incremented by 0 or 1 there, rather than 0 .. 8)

See Counting 1 bits (population count) on large data using AVX-512 or AVX-2, especially Wojciech Muła's big-array popcnt functions: https://github.com/WojciechMula/sse-popcount/ - clang is using the same strategy as popcnt_AVX2_lookup but with a much less efficient way to accumulate the results across iterations.

edited Jan 13 '22 at 17:32

answered Jan 13 '22 at 15:31

Peter Cordes

328,167
45
605
847

I was totally missing the point that instructions without dependency can run in parallel. I mean the code doesn't make sense when each instruction is run sequentially. – xiver77 Jan 14 '22 at 15:11
Starting with Ice Lake Xeon, there is an AVX51 SIMD pop count instruction, supporting 128-bit, 256-bit, and 512-bit vectors. For 64-bit values, 512 bits gives 8 pops per instruction with a throughput of one instruction per cycle and a latency of three cycles -- so 2x faster than Zen3 at 4/clock. The AVX512 advantage increases rapidly if the pop count is being performed on smaller integers -- up to 64 per cycle for 8 bit values in 512-bit registers. – John D McCalpin Jan 18 '22 at 00:12
@JohnDMcCalpin: Yes, even gcc as well as clang auto-vectorize with that with `-march=icelake-client`. https://godbolt.org/z/rso96MWEo (Or `icelake-server`. On the client side unfortunately not quite "starting with" since AVX-512 disappointingly disappeared again with Alder Lake, and Intel's firmware updates lock off the ability to enable it when the E-cores are disabled. Maybe so they can sell cores with defects in the high-256 of execution units? But if so, you'd think they'd have already fused them off... Anyway, that rant's not really relevant here.) – Peter Cordes Jan 18 '22 at 01:14
Oh, interesting, I hadn't realized AVX512 BITALG for `vpopcntb/w` was introduced in Ice Lake as well. So Knight's Mill is the only CPU with `vpopcntd`/`q` (VPOPCNT feature) for 32/64 bit chunks but not `vpopcntb`/`w`. – Peter Cordes Jan 18 '22 at 01:21
@JohnDMcCalpin: ICX and ICL should both support both. GCC and clang `-march=icelake-server` both accept `_mm_popcnt_epi16` and compile it to `vpopcntw`. https://godbolt.org/z/5GxfaeEEr Are you saying they're both buggy, and that instruction will fault on real Ice Lake? Wikipedia agrees with gcc/clang that only Knight's Mill had VPOPCNT without BITALG https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512, but of course it's possible they could all be working from the same wrong information. If you checked CPUID in a VM, did it maybe filter that feature bit? – Peter Cordes Jan 19 '22 at 15:20
@JohnDMcCalpin: http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel00606A6_ICX_InstLatX64.txt from an IceLake-SP confirms that server Ice Lake does indeed have the AVX512_BITALG CPUID feature, and 1/clock 3c latency VPOPCNTB/W/D/Q for x/y/zmm vector widths is there in the test results. (Agreed with there being too many separate extension names / feature bits to keep track of, though. I wouldn't have remembered it was BITALG for those byte/word popcounts, plus one other instruction, if I hadn't looked it up.) – Peter Cordes Jan 19 '22 at 15:22

Bloated code generated for __builtin_popcnt when -mavx2 is on

1 Answers1