Compute sum of bits efficiently with SSE

Question

I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.

I am trying to sum N bits, bit[0], ..., bit[N-1], and write the result in binary in a vector result[0], ..., result[bits_N-1], where bits_N is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i] is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.

In lines 80-105 I make this sum by using 64-bit arithmetic.

In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1] in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1], respectively. Similarly, I store bit[N/2], ...., bit[N-1] in the last 64 bits of BIT[0], ..., BIT[N/2-1], respectively, and sum all the BITs. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.

I am running this on an Intel(R) i7-4980HQ CPU @ 2.80GHz with gcc 7.2.0.

Do you know a way around this?

The real issue is how much SSE calculation you are doing before you need to 'spill to integers'. If you aren't doing enough work, then the moving of data back to integer space is likely going to do just what you describe in terms of overall performance. Remember also that ``_mm_extract_epi64`` is an SSE4 instruction, so it's not necessarily going to be supported on as many target machines. See [this blog post](https://blogs.msdn.microsoft.com/chuckw/2012/09/11/directxmath-sse4-1-and-sse4-2/) — Chuck Walbourn, Oct 23 '18 at 16:35
Since you're counting bits (I think?) you might like [this paper](https://arxiv.org/pdf/1611.07612.pdf) — harold, Oct 23 '18 at 18:03
Possible duplicate of [Counting 1 bits (population count) on large data using AVX-512 or AVX-2](https://stackoverflow.com/a/50085197). Or my answer on [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/a/35270026) shows how to do integer horizontal sums; in your case you'd want a shuffle + paddq. — Peter Cordes, Oct 23 '18 at 19:58
@ Peter Cordes, the examples that you refer to are horizontal sums, while mine is a vertical sum, where 64 sums are made in parallel, see my explanayion. For the same reason, shuffle + padq will not work in my case. — James, Oct 24 '18 at 07:46

score 1 · Answer 1 · answered Oct 23 '18 at 16:51

The low part can be trivially saved with movq instruction or either _mm_storel_epi64 (__m128i* mem_addr, __m128i a); intrinsic storing to memory, or _mm_cvtsi128_si64 storing to register.

There is also a _mm_storeh_pd counterpart, which requires cast to pd and may cause a stall due to mixing floating points and integers.

The top part can be of course moved to low part with _mm_shuffle_epi(src, 0x4e) and then saved with movq.

Compute sum of bits efficiently with SSE

1 Answers1