Bit popcount for large buffer, with Core 2 CPU (SSSE3)

Question

I'm looking for the fastest way to popcount on large buffer of 512 or more bytes. I can guarantee any required alignment, and the buffer size is always a power of 2. The buffer corresponds to block allocations, so typically the bits are either all set, none set, or mostly set favoring the "left" of the buffer, with occasional holes.

Some solutions I've considered are:

I'm interested in the fastest solution, it must work on 32bit x86 chipset belonging to core2 or more recent. SSE and SIMD are of great interest. I'll be testing on the following quad core CPU:

matt@stanley:~/anacrolix/public/stackoverflow$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping        : 11
cpu MHz         : 1600.000
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 4800.21
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

@aaa carp: Please provide a code example making use of this as an answer! Links to canonical descriptions of popcnt and how to use it on GCC are also a good idea. — Matt Joiner, Sep 12 '10 at 07:30
@Matt, you find it mentioned here https://secure.wikimedia.org/wikipedia/en/wiki/SSE4 — Jens Gustedt, Sep 12 '10 at 09:03
@Jens Gustedt: I know of the instruction (although it's not supported on my CPU), but not of it's usage on GCC. — Matt Joiner, Sep 12 '10 at 09:06
@Matt: if you use `gcc` I wouldn't worry in any case to implement this in assembler. I would trust the guys, use `__builtin_popcountll` and compile with `-march=native`. But I don't have that instruction either on my machine, so I can't confirm that this is doing the right thing: on my machine this still results in a function call. — Jens Gustedt, Sep 12 '10 at 10:25
Why? The very first Google hit for "popcount" appears to be a recent page by Bart Massey (author of XCB) documenting his search for the best popcount algorithm, which includes not only the algorithms he tried, but also his benchmarking code and results. — llasram, Sep 12 '10 at 10:26
The CPU you've shown above doesn't have the `popcnt` instruction anyway (there is a specific feature flag for the presence of this instruction, which shows up as `popcnt` in the `flags` line in `/proc/cpuinfo`). — Matthew Slattery, Sep 12 '10 at 12:20
@Matthew Slattery: Yes I pointed that out already, I expected `sse4` for the POPCNT instruction. — Matt Joiner, Sep 12 '10 at 15:06
@llasram: Yes I already looked at, and passed over those, they're not optimized for large buffers. — Matt Joiner, Sep 12 '10 at 15:08
@Matt: Maybe I'm missing something, but what (other than potentially SIMD instructions) would result in the most efficient algorithm for individual words not being the most efficient for large buffers? — llasram, Sep 12 '10 at 15:52
@llasram: Some examples: 24words linked my question is able to operate on 96 bytes chunks without a single branch. Speeding up operations on single words is nice, but there's still an O(n) cost with the implicit bounds checking etc. for a large array. Another is unrolling, algorithms optimized for large buffers can employ this to enormous effect. Often combining a non-trivial sequence of instructions can perform the popcount (or some other task) in far less cycles than operating on words individually. Another algorithm I found managed to use a MULT instruction to shave off cycles. — Matt Joiner, Sep 12 '10 at 16:06
are there any requirements on atomicity / multi-thread access? — rwong, Sep 13 '10 at 02:08
@Matt: I'd expect POPCNT to be implemented in a very small number of cycles, and if that's the case its probably tough to beat, especially if you unroll a loop containing POPCNT 16x or some such. If you *don't* have POPCNT, then tricky assembly code might apply. — Ira Baxter, Sep 14 '10 at 04:08
[Sorry for necro-ing such an old Q] While such experiments are always fun and sometimes even helpful, I'd like to point out that (for no sane, obvious reason) I just compiled and ran the test suite on my moderately-recent (Skylake) desktop. Unsurprisingly, the simplest, most straightforward, _most readable_ solution using the compiler intrinsic runs more than 4 times faster than the "best" optimized (and completely unreadable) version. — Damon, Nov 13 '18 at 15:20

score 5 · Answer 1 · edited Apr 28 '18 at 08:11

5

See a 32 bit version in the AMD Software Optimization guide, page 195 for one implementation. This gives you assembly code for an x86 directly.

See a variant at Stanford bit-twiddling hacks The Stanford version looks like the best one to me. It looks very easy to code as x86 asm.

Neither of these use branch instructions.

These can be generalized to 64 bit versions.

With the 32 or 64 bit versions, you might consider doing a SIMD version. SSE2 will do 4 double-words or two quadwords (either way 128 bits) at once. What you want to do is implement the popcount for 32 or 64 bits in each of the 2 or 4 registers available. You'll end up with 2 or 4 sets of popcounts in the XMM registers when you are done; final step is to store and add those popcounts together to get the final answer. Guessing, I'd expect you do so slightly better doing 4 parallel 32 bit popcounts rather than 2 parallel 64 bit popcounts, as the latter is likely to take 1 or 2 additional instructions in each iteration, and its easy to add 4, 32 bit values together the end.

edited Apr 28 '18 at 08:11

hintss

398
1
3
13

answered Sep 13 '10 at 01:39

Ira Baxter

93,541
22
172
341

1

Can you provide links for the AMD guide, and the SIMD instructions you'd use? – Matt Joiner Sep 13 '10 at 02:12
@Matt: Oops, I bungled the AMD Optimiziation guide link, fixed. – Ira Baxter Sep 13 '10 at 02:15
@Matt: Regarding the SIMD instructions: the coded examples consist mostly of LOAD, AND, SHIFT, ADD scalar machine instructions and no branches. The x86 SIMD instructions have easily-found equivalents; the only real issue is making sure that the operands are aligned on 128 or 256 bit boundaries depending on which SIMD set you are using. You said that wasn't a problem for you :-} – Ira Baxter Sep 13 '10 at 02:18
1

@Matt: You got me playing with this... the solutions all listed here compute the popcnt for a single word. Obviously, you can unroll a loop to do more than one word, and you can go SIMD. What might not be so obvious is that one can compute popcnt for several words at a time at something like half the instructions it takes to do one, amortized (I've sketched an existence proof in the margin of SO :-} but its too bulky to fit in this comment. That would work nicely with your 512-byte buffer to process. How badly do you want the fastest answer? – Ira Baxter Sep 13 '10 at 06:20
@Ira Baxter: I'm glad someone understands there's a difference. I'm keen to see an answers that can beat what I currently have, or appealingly trade performance for simplicity and required chipset features. – Matt Joiner Sep 13 '10 at 08:50
@Matt: You aren't going to get simplicity if you want high performance; you'll get a convoluted piece of code that takes advantage of lots of obscure facts. Want to exhibit what you have so far? – Ira Baxter Sep 13 '10 at 14:09

score 2 · Answer 2 · answered Sep 12 '10 at 16:47

2

If you had popcnt:

http://kent-vandervelden.blogspot.com/2009/10/counting-bits-population-count-and.html

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse42_ATA.htm

answered Sep 12 '10 at 16:47

Anycorn

50,217
42
167
261

Matt Joiner · Accepted Answer · 2012-03-12T12:00:21.910

2

I outline the best C/assembly functions I found for population count/Hamming weight of large buffers below.

The fastest assembly function is ssse3_popcount3, described here. It requires SSSE3, available on Intel Core 2 and later, and AMD chipsets arriving in 2011. It uses SIMD instructions to popcount in 16 byte chunks and unrolls 4 loop iterations at a time.

The fastest C function is popcount_24words, described here. It uses the bit-slicing algorithm. Of note I found that clang could actually generate appropriate vector assembly instructions, which gave impressive performance increases. This aside, the algorithm is still extremely fast.

edited Mar 12 '12 at 12:00

answered Feb 21 '11 at 14:14

Matt Joiner

112,946
110
377
526

1

Note also that SSE4 has a `POPCNT` instruction. – Paul R Mar 16 '11 at 07:22
@Paul R: Yeah this has been mentioned. I restricted solutions to SSSE3 or lower in the question, it's also been found that POPCNT while very fast and convenient is not faster than some vectorized solutions. – Matt Joiner Mar 16 '11 at 20:54
3

This is incorrect. `POPCNT` is the fastest way to do it. [Benchmarks and detailed explanation here](http://danluu.com/assembly-intrinsics/). – dan Nov 03 '14 at 12:48
2

This answer could be greatly improved by showing actual code, rather than being dependent on external sites. As it is, the answer may be invalidated by changes to resources outside your control. – Toby Speight Feb 02 '17 at 12:05
1

@dan: AVX2 beats scalar `popcnt` on modern Intel (Haswell and later). But only with 256-bit vectors: AVX1 / SSSE3 are not faster, IIRC. Ryzen is interesting; it has 4 per clock 64-bit `popcnt`, so it probably does best with scalar. AVX512 `vpternlogd` enables more optimizations: [Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?](//stackoverflow.com/q/45376006), specifically https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp does 30x vpternlogd + 1 vector popcnt for 16x ZMM vectors (16x 512 bits). – Peter Cordes Apr 28 '18 at 12:15

score 1 · Answer 4 · answered Sep 12 '10 at 17:02

1

I would suggest implementing one of the optimised 32 bit popcnt routines from Hacker's Delight, but do it for 4 x 32 bit integer elements in an SSE vector. You can then process 128 bits per iteration, which should give you around 4x throughput compared to an optimised 32 bit scalar routine.

answered Sep 12 '10 at 17:02

Paul R

208,748
37
389
560

I don't know how it's implemented in the routines you're referring to, but certainly the SSE4.1 POPCNT instruction only works on general-purpose, and not on XMM registers. – PhiS Sep 13 '10 at 12:36
@PhiS: indeed - I'm not suggesting using the SSE 4.1 POPCNT - there are some efficient routines in Hacker's Delight for a 32 bit popcnt in C, which could be used as the basis for an SSE 4 x 32 bit popcnt implementation. You would then iterate through your block of data, 128 bits a a time, generating 4 x 32 bit partial population counts, which could then be summed at the end to get the total population count. – Paul R Sep 13 '10 at 13:18
Actually some of the best solutions I found used vector operations to do 128bit counts. – Matt Joiner Feb 21 '11 at 13:56

Bit popcount for large buffer, with Core 2 CPU (SSSE3)

4 Answers4

Linked