Popcount of SSE vectors for binary correlation?

Question

I have this simple binary correlation method, It beats table lookup and Hakmem bit twiddling methods by x3-4 and %25 better than GCC's __builtin_popcount (which I think maps to a popcnt instruction when SSE4 is enabled.)

Here is the much simplified code:

int correlation(uint64_t *v1, uint64_t *v2, int size64) {
  __m128i* a = reinterpret_cast<__m128i*>(v1);
  __m128i* b = reinterpret_cast<__m128i*>(v2);
  int count = 0;
  for (int j = 0; j < size64 / 2; ++j, ++a, ++b) {
    union { __m128i s; uint64_t b[2]} x;
    x.s = _mm_xor_si128(*a, *b);
    count += _mm_popcnt_u64(x.b[0]) +_mm_popcnt_u64(x.b[1]);
  }
  return count;
}

I tried unrolling the loop, but I think GCC already automatically does this, so I ended up with same performance. Do you think performance further improved without making the code too complicated? Assume v1 and v2 are of the same size and size is even.

I am happy with its current performance but I was just curious to see if it could be further improved.

Thanks.

Edit: Fixed an error in union and it turned out this error was making this version faster than builtin __builtin_popcount , anyway I modified the code again, it is again slightly faster than builtin now (15%) but I don't think it is worth investing worth time on this. Thanks for all comments and suggestions.

for (int j = 0; j < size64 / 4; ++j, a+=2, b+=2) {
  __m128i x0 = _mm_xor_si128(_mm_load_si128(a), _mm_load_si128(b));
  count += _mm_popcnt_u64(_mm_extract_epi64(x0, 0))
        +_mm_popcnt_u64(_mm_extract_epi64(x0, 1));
  __m128i x1 = _mm_xor_si128(_mm_load_si128(a + 1), _mm_load_si128(b + 1));
  count += _mm_popcnt_u64(_mm_extract_epi64(x1, 0))
        +_mm_popcnt_u64(_mm_extract_epi64(x1, 1));
}

Second Edit: turned out that builtin is the fastest, sigh. especially with -funroll-loops and -fprefetch-loop-arrays args. Something like this:

for (int j = 0; j < size64; ++j) {
  count += __builtin_popcountll(a[j] ^ b[j]);
}

Third Edit:

This is an interesting SSE3 parallel 4 bit lookup algorithm. Idea is from Wojciech Muła, implementation is from Marat Dukhan's answer. Thanks to @Apriori for reminding me this algorithm. Below is the heart of the algorithm, it is very clever, basically counts bits for bytes using a SSE register as a 16 way lookup table and lower nibbles as index of which table cells are selected. Then sums the counts.

static inline __m128i hamming128(__m128i a, __m128i b) {
  static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
  static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
  const __m128i x = _mm_xor_si128(a, b);
  const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(x, popcount_mask));
  const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(x, 4), popcount_mask));
  return _mm_add_epi8(pcnt0, pcnt1);
}

On my tests this version is on par; slightly faster on smaller input, slightly slower on larger ones than using hw popcount. I think this should really shine if it is implemented in AVX. But I don't have time for this, if anyone is up to it would love to hear their results.

These are probably also already done by the compiler. But you could pre-calculate `size64/2` and move all allocations (`j` and `x`) outside of the loop. Depending on the number of variables whose bits you're counting, you might be able to use (google for) popcount using SSE registers instead of integer registers. As always, you should be checking performance yourself. — inetknght, Dec 31 '14 at 15:23
That union trick is likely to result in bad code. Even if it doesn't go through a memory temporary (but it likely will), it will still be bad. — harold, Dec 31 '14 at 15:26
@inetknght: Thanks, I am only looking for ideas. As you guessed those optimisations are made by compiler and taking union out actually hurts performance. Also, I am already using SSE registers. — mdakin, Dec 31 '14 at 15:33
@harold I actually tried to use __128i variables, _mm_load_si128 to load values and _mm_extract_epi64 to access 64 bit parts of registers. That version ended up worse. — mdakin, Dec 31 '14 at 15:36
If you're willing to delve into assembly (are you really trying to edge *that* much performance out?), this seems like a nifty article: http://wm.ite.pl/articles/sse-popcount.html — inetknght, Dec 31 '14 at 15:42
@inetknght you can do most of that with intrinsics too, of course. And mdakin, look at the disassembly, there's really no point in doing any of this if you don't check it. — harold, Dec 31 '14 at 15:57
Don't you actually want `union { __m128i s; uint64_t b[2];} x;`? As it is, I think `b1` and `b2` occupy the same storage. — Mike Dunlavey, Dec 31 '14 at 16:37
No surprise it becomes hard to improve this, there's about only four arithmetic instructions, the bottleneck is probably in memory fetches. If your data is prone to it, maybe you can consider run-length coding ? How are your 0's and 1's distributed ? — , Dec 31 '14 at 17:27
@YvesDaoust also, I actually tried prefetching as well, but no help, this code is probably an easy target for hardware prefetcher. This code processes ~6.4million bits in ~30microseconds which is about 3GB/s on my machine, I think DDR3 can provide a bit better, I will have an update after I see the assembly dump. — mdakin, Dec 31 '14 at 18:09
@YvesDaoust No, 1 density is big enough to prevent any nice tricks there. What would you do if bias was towards 0 or 1? Just curious. — mdakin, Dec 31 '14 at 18:20
I would try any trick to skip zero words and use a sentinel word to modify the loop termination condition to shortcut the counting. — , Dec 31 '14 at 18:25
[Bit popcount for large buffer, with Core 2 CPU (SSSE3)](https://stackoverflow.com/q/3693981/995714) — phuclv, Feb 17 '19 at 04:43
Possible duplicate of [Bit popcount for large buffer, with Core 2 CPU (SSSE3)](https://stackoverflow.com/questions/3693981/bit-popcount-for-large-buffer-with-core-2-cpu-ssse3) — phuclv, Feb 17 '19 at 04:44

score 12 · Accepted Answer · edited May 23 '17 at 12:01

The problem is that popcnt (which is what __builtin_popcnt compiles to on intel CPU's) operates on the integer registers. This causes the compiler to issue instructions to move data between the SSE and integer registers. I'm not surprised that the non-sse version is faster since the ability to move data between the vector and integer registers is quite limited/slow.

uint64_t count_set_bits(const uint64_t *a, const uint64_t *b, size_t count)
{
    uint64_t sum = 0;
    for(size_t i = 0; i < count; i++) {
        sum += popcnt(a[i] ^ b[i]);
    }
    return sum;
}

This runs at approx. 2.36 clocks per loop on small data sets (fits in cache). I think it run's slow because of the 'long' dependency chain on sum which restricts the CPU's ability to handle more things out of order. We can improve it by manually pipelining the loop:

uint64_t count_set_bits_2(const uint64_t *a, const uint64_t *b, size_t count)
{
    uint64_t sum = 0, sum2 = 0;

    for(size_t i = 0; i < count; i+=2) {
        sum  += popcnt(a[i  ] ^ b[i  ]);
        sum2 += popcnt(a[i+1] ^ b[i+1]);
    }
    return sum + sum2;
}

This runs at 1.75 clocks per item. My CPU is an Sandy Bridge model (i7-2820QM fixed @ 2.4Ghz).

How about four-way pipelining? That's 1.65 clocks per item. What about 8-way? 1.57 clocks per item. We can derive that the runtime per item is (1.5n + 0.5) / n where n is the amount of pipelines in our loop. I should note that for some reason 8-way pipelining performs worse than the others when the dataset grows, i have no idea why. The generated code looks okay.

Now if you look carefully there is one xor, one add, one popcnt, and one mov instruction per item. There is also one lea instruction per loop (and one branch and decrement, which i'm ignoring because they're pretty much free).

$LL3@count_set_:
; Line 50
    mov rcx, QWORD PTR [r10+rax-8]
    lea rax, QWORD PTR [rax+32]
    xor rcx, QWORD PTR [rax-40]
    popcnt  rcx, rcx
    add r9, rcx
; Line 51
    mov rcx, QWORD PTR [r10+rax-32]
    xor rcx, QWORD PTR [rax-32]
    popcnt  rcx, rcx
    add r11, rcx
; Line 52
    mov rcx, QWORD PTR [r10+rax-24]
    xor rcx, QWORD PTR [rax-24]
    popcnt  rcx, rcx
    add rbx, rcx
; Line 53
    mov rcx, QWORD PTR [r10+rax-16]
    xor rcx, QWORD PTR [rax-16]
    popcnt  rcx, rcx
    add rdi, rcx
    dec rdx
    jne SHORT $LL3@count_set_

You can check with Agner Fog's optimization manual that an lea is half a clock cycle in throughout and the mov/xor/popcnt/add combo is apparently 1.5 clock cycles, although i don't fully understand why exactly.

Unfortunately, I think we're stuck here. The PEXTRQ instruction is what's usually used to move data from the vector registers to the integer registers and we can fit this instruction and one popcnt instruction neatly in one clock cycle. Add one integer add instruction and our pipeline is at minimum 1.33 cycles long and we still need to add an vector load and xor in there somewhere... If intel offered instructions to move multiple registers between the vector and integer registers at once it would be a different story.

I don't have an AVX2 cpu at hand (xor on 256-bit vector registers is an AVX2 feature), but my vectorized-load implementation performs quite poorly with low data sizes and reached an minimum of 1.97 clock cycles per item.

For reference these are my benchmarks: enter image description here

"pipe 2", "pipe 4" and "pipe 8" are 2, 4 and 8-way pipelined versions of the code shown above. The poor showing of "sse load" appears to be a manifestation of the lzcnt/tzcnt/popcnt false dependency bug which gcc avoided by using the same register for input and output. "sse load 2" follows below:

uint64_t count_set_bits_4sse_load(const uint64_t *a, const uint64_t *b, size_t count)
{
    uint64_t sum1 = 0, sum2 = 0;

    for(size_t i = 0; i < count; i+=4) {
        __m128i tmp = _mm_xor_si128(
                    _mm_load_si128(reinterpret_cast<const __m128i*>(a + i)),
                    _mm_load_si128(reinterpret_cast<const __m128i*>(b + i)));
        sum1 += popcnt(_mm_extract_epi64(tmp, 0));
        sum2 += popcnt(_mm_extract_epi64(tmp, 1));
        tmp = _mm_xor_si128(
                    _mm_load_si128(reinterpret_cast<const __m128i*>(a + i+2)),
                    _mm_load_si128(reinterpret_cast<const __m128i*>(b + i+2)));
        sum1 += popcnt(_mm_extract_epi64(tmp, 0));
        sum2 += popcnt(_mm_extract_epi64(tmp, 1));
    }

    return sum1 + sum2;
}

How about using `pshufb` for the popcnt to stay in xmm regs? How does that compare? — harold, Jan 01 '15 at 18:05
with `pshufb` you'll still have to transfer the data from the xmm register to the integer ('general purpose') registers before you can use `popcnt`. I'm not sure what you're trying to tell me. — Stefan, Jan 01 '15 at 18:34
There is no transfer. The `pshufb` (well, two of them) implement the popcnt, so there is no actual `popcnt`. — harold, Jan 01 '15 at 18:37
Ahh. I see your point. Unfortunately, i can't seem to get it faster than the default implementation (which is already much slower than 4-way pipelined). I'm using this sse-popcnt implementation: http://stackoverflow.com/a/17531164/3371224. — Stefan, Jan 01 '15 at 22:16
@Stefan Thanks for excellent analysis, actually after asking this question I tried many different combinations, but with gcc options "-march=native", "-msse4", "-funroll-loops", "-fprefetch-loop-arrays" unfortunately I did not see a real improvement over the original simple loop. But I will read your analysis and code carefully, maybe I missed something else. — mdakin, Jan 02 '15 at 20:03
@Stefan What I mean is, I already tried your variation as well, but it seems my compiler already does all the tricks. So in the end I did not see any improvement with these changes. But thanks again for your beautiful analysis. — mdakin, Jan 03 '15 at 13:42
@harold, Actually SSE3 based version ended up nearly as fast as the pipelined version of hw popcount. If there was an AVX version, I am almost certain it would beat it with some healhy margin. (I added n edit to original question) — mdakin, Jan 05 '15 at 12:47
@mdakin: AVX2 vector popcnt has possibilities, esp. when you want to do other vectorizable ops like XOR. However, one 8B read per clock is all you need to saturate main memory with a single core, on a typical CPU. If your data is hot in cache, then you can go faster with better code. — Peter Cordes, Aug 20 '15 at 20:48

score 2 · Answer 2 · answered Jan 03 '15 at 01:46

2

Have a look here. There is an SSSE3 version that beats the popcnt instruction by a lot. I'm not sure but you may be able to extend it to AVX as well.

answered Jan 03 '15 at 01:46

Apriori

2,308
15
16

I looked at those before, but I am very suspicious of the results. The tests were done 4-5 years ago, and it looks like compilers got a lot better in 4 years. But I will try to give it a shot anyway. Thanks. – mdakin Jan 03 '15 at 14:35
Also according to: http://www.strchr.com/crc32_popcnt?allcomments=1#comment_482 It was only %3 faster. – mdakin Jan 03 '15 at 14:43
@mdakin My apologies if it's false advertising. Last time I looked at this I was implementing the CRC32 half of the article. I used the instruction, and did a 3-way slice (I had to use read barriers to keep the compiler from reordering the instructions) and it did nearly triple in speed. Though I think the slicing they are talking about here is something else. Anyway, that's not the part you're interested in of course. – Apriori Jan 03 '15 at 22:47
@mdakin It is a little suspicious that the unrolling helps so much now that you mention it. Any reasonable compiler would unroll the loop of course. Maybe they only tested their assembly code. Since you're counting so many bits, and not just one word you can probably make some optimizations here. For instance having code for independent operations inside the loop on side-by-side data may pipeline nicely. You'd want to look up latency and throughput of the instruction, or just experimentally determine the sweet spot. – Apriori Jan 03 '15 at 22:48
@mdakin Also, it looks like they are adding the pop-counts into a byte vector (up to 32 times to avoid overflow) before summing the components for the final value, I would think this would help alot. That is, inside the innermost loop you have just _mm_shuffle_epi8 and _mm_adds_epu8. Any maybe unroll the loop in various sizes; I've found that sometimes it helps even with modern compilers. I'd be curious what you find. I'd like to spend some time trying to find the fastest way to do this, unfortunately I'm not sure I have it. I'd be interested to know what you find. – Apriori Jan 03 '15 at 22:49
I am implementing this now, I think this algorithm really shines on AVX but lets see how it will do on SSE3. Will report back results. – mdakin Jan 04 '15 at 08:59
I tried the algorithm (updated the question) SSE3 version ended up slightly faster than hardware popcount version on smaller input. AVX probably will beat it definitely. But I don't have time for this now. Thanks for this suggestion, it is a very sweet algorithm. – mdakin Jan 05 '15 at 12:50
@mdakin No problem. I'm glad it helped a little bit at least. I would suspect the same as you about the AVX version. It's a shame that to properly test relative perf of this one has to write up all the other algorithms to test against; I agree, very time consuming. – Apriori Jan 05 '15 at 16:23

Popcount of SSE vectors for binary correlation?

2 Answers2