I have large in-memory array as some pointer uint64_t * arr
(plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.
By shifting whole array I mean not to shift each element (like a[i] <<= Shift
), but to shift it as a single large bit vector. In other words for each intermediate position i
(except for first and last element) I can do following in a loop:
dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);
where w
is some temporary variable, holding right-shifted value of previous array element.
This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.
Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.
But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.
Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg
instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.
BTW, I need solutions for all of MSVC/GCC/CLang.
Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.
If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).
In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.
Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X
, and several hundreds of shift sizes s0, s1, ..., sN
, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN)
. I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.
As requested by @Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src
to produced or-ed final dst
bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:
#include <cstdint>
#include <vector>
#include <random>
#define bit_sizeof(x) (sizeof(x) * 8)
using u64 = uint64_t;
using T = u64;
int main() {
std::mt19937_64 rng{123};
// Random generate source bit vector
std::vector<T> src(100'000);
for (size_t i = 0; i < src.size(); ++i)
src[i] = rng();
size_t const src_bitsize = src.size() * bit_sizeof(T);
// Destination bit vector, for example twice bigger in size
std::vector<T> dst(src.size() * 2);
// Random generate shifts
std::vector<u64> shifts(200);
for (size_t i = 0; i < shifts.size(); ++i)
shifts[i] = rng() % src_bitsize;
// Right-shift that handles overflow
auto Shr = [](auto x, size_t s) {
return s >= bit_sizeof(x) ? 0 : (x >> s);
};
// Do actual Shift-Ors
for (auto orig_shift: shifts) {
size_t const
word_off = orig_shift / bit_sizeof(T),
bit_off = orig_shift % bit_sizeof(T);
if (word_off >= dst.size())
continue;
size_t const
lim = std::min(src.size(), dst.size() - word_off);
T w = 0;
for (size_t i = 0; i < lim; ++i) {
dst[word_off + i] |= w | (src[i] << bit_off);
w = Shr(src[i], bit_sizeof(T) - bit_off);
}
// Special case of handling for last word
if (word_off + lim < dst.size())
dst[word_off + lim] |= w;
}
}
My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP
to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C++ function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.