I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.
I am trying to sum N bits, bit[0], ..., bit[N-1]
, and write the result in binary in a vector result[0], ..., result[bits_N-1]
, where bits_N
is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i]
is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.
In lines 80-105 I make this sum by using 64-bit arithmetic.
In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1]
in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1]
, respectively. Similarly, I store bit[N/2], ...., bit[N-1]
in the last 64 bits of BIT[0], ..., BIT[N/2-1]
, respectively, and sum all the BIT
s. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.
I am running this on an Intel(R) i7-4980HQ CPU @ 2.80GHz with gcc 7.2.0.
Do you know a way around this?