Efficiently shift-or large bit vector

Question

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.

By shifting whole array I mean not to shift each element (like a[i] <<= Shift), but to shift it as a single large bit vector. In other words for each intermediate position i (except for first and last element) I can do following in a loop:

dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);

where w is some temporary variable, holding right-shifted value of previous array element.

This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.

Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.

But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.

Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.

BTW, I need solutions for all of MSVC/GCC/CLang.

Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.

If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).

In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.

Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X, and several hundreds of shift sizes s0, s1, ..., sN, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN). I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.

As requested by @Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src to produced or-ed final dst bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:

Try it online!

#include <cstdint>
#include <vector>
#include <random>

#define bit_sizeof(x) (sizeof(x) * 8)

using u64 = uint64_t;
using T = u64;

int main() {
    std::mt19937_64 rng{123};

    // Random generate source bit vector
    std::vector<T> src(100'000);
    for (size_t i = 0; i < src.size(); ++i)
        src[i] = rng();

    size_t const src_bitsize = src.size() * bit_sizeof(T);

    // Destination bit vector, for example twice bigger in size
    std::vector<T> dst(src.size() * 2);

    // Random generate shifts
    std::vector<u64> shifts(200);
    for (size_t i = 0; i < shifts.size(); ++i)
        shifts[i] = rng() % src_bitsize;

    // Right-shift that handles overflow
    auto Shr = [](auto x, size_t s) {
        return s >= bit_sizeof(x) ? 0 : (x >> s);
    };

    // Do actual Shift-Ors
    for (auto orig_shift: shifts) {
        size_t const
            word_off = orig_shift / bit_sizeof(T),
            bit_off = orig_shift % bit_sizeof(T);

        if (word_off >= dst.size())
            continue;
        
        size_t const
            lim = std::min(src.size(), dst.size() - word_off);

        T w = 0;
        
        for (size_t i = 0; i < lim; ++i) {
            dst[word_off + i] |= w | (src[i] << bit_off);
            w = Shr(src[i], bit_sizeof(T) - bit_off);
        }

        // Special case of handling for last word
        if (word_off + lim < dst.size())
            dst[word_off + lim] |= w;
    }
}

My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C++ function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.

If you have gigabytes of data, aren't you memory-bound anyway? And shouldn't you be considering merging this operation with the previous or next one, unless both of those are somehow restricted by hardware or so? — Davis Herring, Nov 20 '21 at 07:11
@DavisHerring If I'm memory bound with single shift of vector, then at least maybe I'm not memory bound if I have to shift-or several vectors? For example if I have large bit vector `X`, and several shift sizes `s0, s1, ..., sN` then I want to compute `Y = (X << s0) | (X << s1) | ... | (X << sN)`. — Arty, Nov 20 '21 at 07:15
Have a look at GMP's low-level `mpn` lshift / rshift functions: https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/fastsse/lshiftc-movdqu2.asm - you could use GMP (GPL license) or maybe translate their loop into intrinsincs. — Peter Cordes, Nov 20 '21 at 07:15
Obviously if the shift-count is a multiple of 8, this becomes memmove + memset which can be very fast on modern CPUs that handle unaligned SIMD loads/stores efficiently. — Peter Cordes, Nov 20 '21 at 07:18
@PeterCordes Unfortunately, I have arbitrary shift sizes. Of course I can have if-branch for the case of multiply of 8 shifts. Please see my **Note**, I just added this note to the end of my question, it tells about original non-simplified task. — Arty, Nov 20 '21 at 07:20
Ok, yeah, you'll want to shift on the fly as your read inputs in groups of maybe 4 to 8, not actually store each bit-aligned temporary array back to memory (unless you're going to reuse the same shift-count soon). — Peter Cordes, Nov 20 '21 at 07:20
And yeah you'll probably want to incorporate a way to handle the special %8 case for some inputs, maybe by grouping together the pointers that go with the no-shift case? Or no, you want to overlap ALU and memory, so maybe have a version of a 4 or 5-input function that allows one of them to be unshifted so you just offset the pointer. — Peter Cordes, Nov 20 '21 at 07:24
Hmm, maybe also look for two shift counts the same modulo 8, and OR them *before* shifting. (Preferably still on the fly, although this would start to balloon the number of versions of the function you need even with sorting, without JITing.) So again, maybe a version of the function where the first 2 inputs share a shift count (but allow different lengths so you can offset their pointers). Or if you sort before batching up into groups of 4 input streams, maybe it's likely that you'll have more than 2 shift counts the same. Updating one of them in-place un-shifted will keep it hot in cache — Peter Cordes, Nov 20 '21 at 07:28
@PeterCordes Yes, exactly I did like this already. I created C++ templated function specialization for all 64 shift sizes, so that shift size is `constexpr` value inside each function, it is template parameter. Then I created array of 64 instances of `std::function` (in other words functions pointers) and just call one of these 64 functions depending on shift size. Through doing this way I can have shift size as compile time `constexpr` value, which will optimize instructions, and also can do specialized variants of functions for shift sizes multiple of 8. — Arty, Nov 20 '21 at 07:28
@You don't need AVX for this task. Find the LCM of shift and 64, and do multi-threading. The function will be bandwidth limited even at two threads. — Jake 'Alquimista' LEE, Nov 20 '21 at 07:31
@PeterCordes Nice idea with grouping two shift sizes that are equal modulo 8... BTW, all the multiple-of-8 optimizations will save me just 1/8 of time (12.5%), because I have random uniformly distributed shift sizes. — Arty, Nov 20 '21 at 07:31
@Jake'Alquimista'LEE Actually if you see my new **Note** in the Question, then you can notice that I need serveral Shift-Or operations on same input vector. Probably in this case, if I have many shift sizes, for example hundreds, then I will be still CPU bound, not memory bound. Because each final word in resulting vector has to be computed based on OR of hundreds of words, which is a CPU bound computation. — Arty, Nov 20 '21 at 07:34
@Jake'Alquimista'LEE Right now C++ algorithm is simple, I just compute `Y = (X << s0) | (X << s1) | ... | (X << sN)`, where `X` is large input bit vector, `Y` is large output bit vector and `s0, ..., sN` is a hundred of shift sizes from 0 to 1M. And I do this computation in two loops, outer loop goes through all `sK`, and inner loop does actual shift-or the way I showed in the code at the beginning of my question. Nothing else. This is not-optimized solution. Hence I'm looking for all suggestions regarding what can be done to optimize it. — Arty, Nov 20 '21 at 07:40
@Jake'Alquimista'LEE I don't post algorithm itself, because it is a part of bigger program, and it does different unrelated stuff in the middle, in other words I don't have ready-made example that can be shared. If I post whole my code then it is thousands of lines. Otherwise I have to implement again from scratch non-optimal clean minimal reproducible example here specially for my Question, which I didn't implement. — Arty, Nov 20 '21 at 07:42
You're going to need runtime-variable shift counts if you support groups of multiple vectors, otherwise the combinatorial explosion of template versions would be too much. For just one, yeah scalar `shld` is more efficient with constant shift counts than `cl` on Intel CPUs, and also AMD although `shld` in general is not very efficient on AMD, more of a win to use SIMD. You may want to only use 128-bit vectors, especially if AVX-512 isn't available. AVX2 has higher latency for lane-crossing shifts, and the available lane-crossing shifts are more limited. — Peter Cordes, Nov 20 '21 at 07:45
On Skylake, `vpsllvq xmm, xmm, xmm` (with a per-element count) is faster than `vpsllq xmm, xmm, xmm` (using the low element of the last operand as the count), presumably implemented as broadcast + SIMD-variable-shift uops. On Zen, either way is efficient. On all Intel, shift vector by vector is still 2 uops, but per-element variable shift is even worse at 3 instead of 1 on Haswell (the first with that instruction). https://agner.org/optimize/ https://uops.info/. But Haswell / Broadwell are pretty old by now, so if you require AVX2 at all, you might broadcast shift counts for vpsllvq. — Peter Cordes, Nov 20 '21 at 07:46
@Jake'Alquimista'LEE No, not 64 times or. See my **Note** in question. I have hundreds of shift sizes, each shift size is from 0 to 1M. It means I have to do hundreds of ORs. But because each large shift can be represented as 64-bit-word offset plus shift of 0 to 63 then I should do only shifts of 64 sizes, but still need hundreds of them plus ORs. — Arty, Nov 20 '21 at 07:48
@Arty then post a working loop, not in words!!! Post exactly what you need to do. — Jake 'Alquimista' LEE, Nov 20 '21 at 07:48
@Jake'Alquimista'LEE OK, I'll implement a toy example and post it. — Arty, Nov 20 '21 at 07:50
@Jake'Alquimista'LEE Just added to the very end of my Question a simple toy example algorithm, it is not optimized at all. It is close (but not identical) to what I have already in my project. In my real project I already did some optimizations, for example created array of 64 functions, representing inner of two loops, each specialized for compile time value of shift-size. Also already I did parallelization through `OpenMP`. — Arty, Nov 20 '21 at 08:20
You right-shift count actually needs to be `src[i] >> (64-Shift)` I think. (If you special-case mod8 shift counts such as zero, you don't have to deal with 64-0 = 64 out-of-range shift count undefined behaviour on uint64_t elements.) — Peter Cordes, Nov 20 '21 at 08:22
@PeterCordes, thanks, corrected my Question. I actually do `64-Shift` right-shift already in my project, I just forgot it here when was writing this post. My project already works and solves real-world task correctly without bugs. I just need optimizations. — Arty, Nov 20 '21 at 08:25
GCC has no problem auto-vectorizing this for compile-time shift counts: https://godbolt.org/z/PKbhqjdM4 — chtz, Nov 20 '21 at 08:26
@chtz: Ah, interesting strategy from GCC, unaligned loads offset from each other by 8 bytes, allowing efficient use of 256-bit vectors. (For this shift of less than 64 bits). (vs clang doing it the hard way, with `vpalignr` / `vperm2i128`) — Peter Cordes, Nov 20 '21 at 08:28
@PeterCordes with AVX2 I always try to do interlane-shifts by offset loads (`vpalignr` is pretty much useless for this ...) This usually only works if the input comes directly from memory, of course. — chtz, Nov 20 '21 at 08:39
@PeterCordes Do you think that writing manual SIMD intrinsics code here is worth doing? Or GCC's auto-vectorization with loop unrollment is quite enough and will be about the same speed as most advanced manual intrinsics code? — Arty, Nov 20 '21 at 10:09
@Arty: If you're only ever going to compile with recent GCC versions, that's probably fine, as long as you get similarly good asm once you start writing versions that use the same shift count for multiple vectors. (If you have hundreds of inputs, many of them will have the same shift count %8 or even %16, so you should group those together). — Peter Cordes, Nov 20 '21 at 10:25
@PeterCordes One more question - is it possible somehow to mark certain pointer as guaranteed to be aligned? So that compiler may create aligned load/store instructions for SIMD. For example some imaginary attribute `__always_aligned__` like `uint64_t __always_aligned(64)__ * ptr = get_ptr();`, that says that my function `get_ptr()` will always return pointer such that its address `% 64 == 0`. Is there such attribute? Because my `std::vector` is always 64-bytes aligned, I wrote special aligned allocator for it. So there is no point in un-aligned SIMD load/store in auto-vectorization of GCC. — Arty, Nov 20 '21 at 11:05
@MarcStevens If you're interested, I created a [separate question](https://stackoverflow.com/questions/70045775/) regarding this alignment task. — Arty, Nov 20 '21 at 13:36

Marc Stevens · Answer 1 · 2021-11-20T12:33:42.413

You can, and possibly you don't even need to use SIMD instructions explicitly. The target compilers GCC, CLANG and MSVC and other compilers like ICC all support auto-vectorization. While hand-optimized assembly can outperform compiler generated vectorized instructions, it's generally harder to achieve and you may need several versions for different architectures. Generic code that leads to efficient auto-vectorized instructions is a solution that may be portable across many platforms.

For instance a simple shiftvec version

void shiftvec(uint64_t* dst, uint64_t* src, int size, int shift)
{
    for (int i = 0; i < size; ++i,++src,++dst)
    {
        *dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
    }
}

compiled with a recent GCC (or CLANG works as well) and -O3 -std=c++11 -mavx2 leads to SIMD instructions in the core loop of the assembly

.L5:
  vmovdqu ymm4, YMMWORD PTR [rsi+rax]
  vmovdqu ymm5, YMMWORD PTR [rsi+8+rax]
  vpsllq ymm0, ymm4, xmm2
  vpsrlq ymm1, ymm5, xmm3
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rdi+rax], ymm0
  add rax, 32
  cmp rax, rdx
  jne .L5

See on godbolt.org: https://godbolt.org/z/5TxhqMhnK

This also generalizes if you want to do combine multiple shifts in dst:

void shiftvec2(uint64_t* dst, uint64_t* src1, uint64_t* src2, int size1, int size2, int shift1, int shift2)
{
    int size = size1<size2 ? size1 : size2;
    for (int i = 0; i < size; ++i,++src1,++src2,++dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));
        *dst |= ((*src2)<<shift2) | (*(src2+1)>>(64-shift2)); 
    }
    for (int i = size; i < size1; ++i,++src1,++dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));        
    }
    for (int i = size; i < size2; ++i,++src2,++dst)
    {
        *dst = ((*src2)<<shift2) | (*(src2+1)>>(64-shift2));
    }
}

compiles to a core-loop:

.L38:
  vmovdqu ymm7, YMMWORD PTR [rsi+rcx]
  vpsllq ymm1, ymm7, xmm4
  vmovdqu ymm7, YMMWORD PTR [rsi+8+rcx]
  vpsrlq ymm0, ymm7, xmm6
  vpor ymm1, ymm1, ymm0
  vmovdqu YMMWORD PTR [rax+rcx], ymm1
  vmovdqu ymm7, YMMWORD PTR [rdx+rcx]
  vpsllq ymm0, ymm7, xmm3
  vmovdqu ymm7, YMMWORD PTR [rdx+8+rcx]
  vpsrlq ymm2, ymm7, xmm5
  vpor ymm0, ymm0, ymm2
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rax+rcx], ymm0
  add rcx, 32
  cmp r10, rcx
  jne .L38

Combining multiple sources in one loop will reduce the total amount of memory bandwidth spent on loading/writing the destination. The limit in how many you can combine is of course limited by available registers. Note that xmm2 and xmm3 for shiftvec contain the shift values, so having different versions for compile-time known shift values may free those registers.

Additionally using __restrict (supported by GCC,CLANG,MSVC) for each of the pointers will tell the compiler that the ranges are not overlapping.

I initially had problems with MSVC giving proper auto vectorized code, but it seems adding more SIMD-like structure will make it work for all three desired compilers GCC, CLANG and MSVC:

void shiftvec(uint64_t* __restrict dst, const uint64_t* __restrict src, int size, int shift)
{
    int i = 0;
    // MSVC: use steps of 2 for SSE, 4 for AVX2, 8 for AVX512
    for (; i+4 < size; i+=4,dst+=4,src+=4)
    {
        for (int j = 0; j < 4; ++j)
            *(dst+j) = (*(src+j))<<shift;
        for (int j = 0; j < 4; ++j)
            *(dst+j) |= (*(src+1)>>(64-shift));
    }
    for (; i < size; ++i,++src,++dst)
    {
        *dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
    }    
}

Unfortunately whereas MSVC does auto-vectorizing, can be told that pointers don't alias, and can even generate check for aliasing and generate two loop versions (aliased and vectorized), the vectorization capabilities themselves are very limited, it cannot vectorize this shift. (An option could be to compile this function with `clang-cl` and take advantage of binary compatibility of `clang-cl` and `cl`). — Alex Guteniev, Nov 20 '21 at 11:47
Yeah, I've tried to get MSVC to auto vectorize the code above. Weirdly godbolt shows for me that it works for msvc x86, but not for msvc x64. In the x64 case using `/Qvec-report:2` gives auto vectorization failure 1200. — Marc Stevens, Nov 20 '21 at 11:51
Interesting. It also fails to vectorize on x86 for 32-bit type https://godbolt.org/z/fTo3oT5n1 . I've reported missing optimization, maybe they'll do something with this someday https://developercommunity.visualstudio.com/t/Missing-optimization:-shift-loop-vectori/1590948 — Alex Guteniev, Nov 20 '21 at 12:02
I can confirm that even MSVC can autovectorize it properly using somewhat more SIMD-like structured code: ``` void shiftvec(uint64_t* __restrict dst, const uint64_t* __restrict src, int size, int shift) { int i = 0; for (; i+4 < size; i+=4,dst+=4,src+=4) { for (int j = 0; j < 4; ++j) *(dst+j) = (*(src+j))<>(64-shift)); } for (; i < size; ++i,++src,++dst) { *dst = ((*src)<>(64-shift)); } } ``` — Marc Stevens, Nov 20 '21 at 12:29
@AlexGuteniev: If you want good asm across compilers, take a good vectorization strategy from GCC (in this case), or clang, and convert that back to intrinsics. — Peter Cordes, Nov 20 '21 at 18:57

Aki Suihkonen · Answer 2 · 2022-05-11T08:04:20.140

4

I would attempt to rely on x64 ability to read from unaligned addresses, and to do that with almost no visible penalty when stars are properly (un)aligned. One would only need to handle a few cases of (shift % 8) or (shift % 16) -- all doable with SSE2 instruction set, fixing the remainder with zeros and having an unaligned offset to the data vector and addressing the UB by memcpy.

That said, the inner loop would look like:

uint16_t const *ptr;
auto a = _mm_loadu_si128((__m128i*)ptr);
auto b = _mm_loadu_si128((__m128i*)(ptr - 1);
a = _mm_srl_epi16(a, c);
b = _mm_sll_epi16(b, 16 - c);
_mm_storeu_si128((__m128i*)ptr, mm_or_si128(a,b));
ptr += 8;

Unrolling this loop a few times, one might be able to use _mm_alignr_epi8 on SSE3+ to relax memory bandwidth (and those pipeline stages that need to combine results from unaligned memory accesses):

auto a0 = w; 
auto a1 = _mm_load_si128(m128ptr + 1);
auto a2 = _mm_load_si128(m128ptr + 2);
auto a3 = _mm_load_si128(m128ptr + 3);
auto a4 = _mm_load_si128(m128ptr + 4);
auto b0 = _mm_alignr_epi8(a1, a0, 2);
auto b1 = _mm_alignr_epi8(a2, a1, 2);
auto b2 = _mm_alignr_epi8(a3, a2, 2);
auto b3 = _mm_alignr_epi8(a4, a3, 2);
// ... do the computation as above ...
w = a4;   // rotate the context

edited May 11 '22 at 08:04

answered Nov 20 '21 at 08:41

Aki Suihkonen

19,144
1
36
57

Surely one can use `_mm_sll_epi64` also to get exactly what you needed; my assumption would be anyway, that it's faster to read from unaligned addresses when `shift %8 ==0`, than to run this algorithm in the first place. – Aki Suihkonen Nov 20 '21 at 08:44
Is it faster to do 8 or 16 or 32 or 64 SSE bit shift instructions? Or they are the same in speed? – Arty Nov 20 '21 at 08:55
1

@Arty: They're all the same in speed on all CPUs, see https://uops.info/ and https://agner.org/optimize/ – Peter Cordes Nov 20 '21 at 10:04
1

As commented elsewhere too, Intel does not have 8-bit shifts at all; thus if needed, they are about 2x as slow, as they need to be emulated. But in this case any granularity will work exactly the same. – Aki Suihkonen Nov 20 '21 at 11:24
I think you have a small typo - instead of `b = _mm_sll_epi16(a, 16 - c);` should be `b = _mm_sll_epi16(b, 16 - c);` (`a` replaced with `b`). Also `b` should be `ptr - 1` if you're doing right shift, or swap `_mm_srl_epi16` for `a` and `_mm_sll_epi16` for `b` to do left shift. – Arty May 11 '22 at 07:40
Yes, you are right. fixed. – Aki Suihkonen May 11 '22 at 08:04

Brendan · Answer 3 · 2021-11-20T13:17:26.360

In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.

The key to efficiency is to be lazy. The key to being lazy is to lie - pretend you shifted without actually doing any shifting.

For an initial example (to illustrate the concept only), consider:

struct Thingy {
    int ignored_bits;
    uint64_t data[];
}

void shift_right(struct Thingy * thing, int count) {
    thing->ignored_bits += count;
}

void shift_left(struct Thingy * thing, int count) {
    thing->ignored_bits -= count;
}

int get_bit(struct Thingy * thing, int bit_number) {
    bit_number += thing->ignored_bits;
    return !!(thing->data[bit_number / 64] & (1 << bit_number % 64));
}

For practical code you'll need to care about various details - you'll probably want to start with spare bits at the start of the array (and non-zero ignored_bits) so that you can pretend to shift right; for each small shift you'll probably want to clear "shifted in" bits (otherwise it'll behave like floating point - e.g. (5.0 << 8) >> 8) == 5.0); if/when ignored_bits goes outside a certain range you'll probably want a large memcpy(); etc.

For more fun; abuse low level memory management - use VirtualAlloc() (Windows) or mmap() (Linux) to reserve a huge space, then put your array in the middle of the space, then allocate/free pages at the start/end of array as needed; so that you only need to memcpy() after the original bits have been "shifted" many billions of bits to the left/right.

Of course the consequence is that it's going to complicate other parts of your code - e.g. to OR 2 bitfields together you'll have to do a tricky "fetch A; shift A to match B; result = A OR B" adjustment. This isn't a deal breaker for performance.

Thanks for nice idea! UpVoted. I like idea about pretending to be shifted. Actually it can partially (not fully) help me in my task. Because I have several cumulative Shit-Or stages. And your lazy idea may help at first two or last two stages to save some memory. — Arty, Nov 20 '21 at 13:20

score 0 · Answer 4 · answered Nov 20 '21 at 09:35

0

#include <cstdint>
#include <immintrin.h>

template<unsigned Shift>
void foo(uint64_t* __restrict pDst, const uint64_t* __restrict pSrc, intptr_t size)
{
    uint64_t* pSrc0, * pSrc1, * pSrc2, * pSrc3, * pDst0, * pDst1, * pDst2, * pDst3;
    __m256i prev, current;
    intptr_t i, stride;

    stride = size >> 2;
    i = stride;

    pSrc0 = pSrc;
    pSrc1 = pSrc + stride;
    pSrc2 = pSrc + 2 * stride;
    pSrc2 = pSrc + 3 * stride;

    pDst0 = pDst;
    pDst1 = pDst + stride;
    pDst2 = pDst + 2 * stride;
    pDst3 = pDst + 3 * stride;

    prev = _mm256_set_epi64x(0, pSrc1[-1], pSrc2[-1], pSrc3[-1]);

    while (i--)
    {
        current = _mm256_set_epi64x(*pSrc0++, *pSrc1++, *pSrc2++, *pSrc3++);
        prev = _mm256_srli_epi64(prev, 64 - Shift);
        prev = _mm256_or_si256(prev, _mm256_slli_epi64(current, Shift));
        *pDst0++ = _mm256_extract_epi64(prev, 3);
        *pDst1++ = _mm256_extract_epi64(prev, 2);
        *pDst2++ = _mm256_extract_epi64(prev, 1);
        *pDst3++ = _mm256_extract_epi64(prev, 0);

        prev = current;
    }
}

You can do the operation on up to four 64bit elements at once on AVX2 (up to eight on AVX512)

If size isn't a multiple of four, there will be up to 3 remaining ones to deal with.

PS: Auto vectorization is never a proper solution.

answered Nov 20 '21 at 09:35

Jake 'Alquimista' LEE

6,197
2
17
25

2

Yuck, why would you want to gather 1 element from each of 4 positions strided through the array, instead of turning GCC's much better auto-vectorization strategy back into intrinsics like `_mm256_loadu_si256`? Memory-destination `vpextrq mem, xmm, imm` is 2 fused-domain uops on Skylake, p5 shuffle plus a p237+p4 micro-fused store, so it competes with the shuffles needed for `_mm256_set_epi64x` on non-constant data. Also, `vpextrq` from the upper half of a YMM requires a `vextracti128` first, because `vpextrq` only works on an XMM source. – Peter Cordes Nov 20 '21 at 09:48
2

If you want to give HW prefetchers multiple pages to look at, you should be doing 2 or 4x 256-bit vectors at a time in an unrolled loop, probably still using GCC's strategy. I haven't benchmarked, but on CPUs where 4k-splits aren't extremely expensive, I expect this to be significantly slower. Maybe worse even on CPUs before Skylake were a misaligned load that crosses a 4k boundary costs over 100 cycles. (Avoiding unaligned loads seems the only advantage here, but L1d cache absorbs those well.) – Peter Cordes Nov 20 '21 at 09:50
@PeterCordes I don't want the same memory to be read twice. It consumes power (You know my ARM background) And since size isn't fixed, there is no guarantee that stride will be a multiple of four. And most of all, I don't know Intel architecture very well: That above is almost exactly what I would do on NEON. (except for the `sri` `sli` instead of `or`) – Jake 'Alquimista' LEE Nov 20 '21 at 10:13
Executing all those shuffle uops very likely consumes more power than the load execution units would reading L1d with two unaligned loads instead of 1. You're not touching DRAM multiple times; the CPU has cache. Also, finishing the work and returning to a deep sleep state sooner saves much more power. x86 doesn't have strided SIMD loads in the first place, unlike NEON; I wasn't suggesting that. – Peter Cordes Nov 20 '21 at 10:23
@PeterCordes I already told you that I'm VERY disappointed in AVX2 a few years ago. I completely agree with Linus Torvalds on AVX. It heavily lacks in integer operations. I was shocked finding out that AVX even has no barrel shifter for 8bit in addition to permutaions being so clumsy and inconvenient. OpenCV would be much better with NEON than AVX. – Jake 'Alquimista' LEE Nov 20 '21 at 10:32
1

Linus has most notably complained about AVX512 as a "power virus". I don't recall comments from him on AVX2. AVX-512 has much better shuffles, but still no 8-bit shifts :/ Definitely still some warts, but really unfortunate that AVX-512's nice features like masking and better shifts are still mostly limited to servers because they only come with wide vectors, and Alder Lake won't be bringing them to the desktop for the most part, not on CPUs with E-cores enabled. Anyway, not liking an extension might explain writing sub-optimal code for it, but doesn't mean other folks can't do better. – Peter Cordes Nov 20 '21 at 10:48
@PeterCordes AVX-512 is much better. My image processing routine runs more than ten times faster on AVX-512 than on AVX2. I very was glad AVX-512 becoming "mandatory" on Intel chips. Then Intel let me down with AlderLake with its stupid E cores. I was considering building a 12600KF machine, but settled with 11400F last week. BTW I'll let you know about this new algorithm of mine when it goes public early next year (after applying for a patent) I'd be more than happy if you reviewed it, and really appreciate any suggestion on optimizing it further. – Jake 'Alquimista' LEE Nov 20 '21 at 12:42
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239416/discussion-between-jake-alquimista-lee-and-peter-cordes). – Jake 'Alquimista' LEE Nov 20 '21 at 12:50

score -3 · Answer 5 · answered Nov 20 '21 at 07:04

-3

No, you can't

Both NEON and AVX(512) support barrel shift operations up to 64bit elements.

You can however "shift" the whole 128bit vector by n-bytes (8bits) with the instruction ext on NEON and alignr on AVX.

And you should avoid using the vector class for performance since it's nothing else than linked list which is bad for the performance.

answered Nov 20 '21 at 07:04

Jake 'Alquimista' LEE

6,197
2
17
25

So if SIMD can't do this, then at least I'm looking for solution about how to speedup this algorithm by any improvements that can be made. – Arty Nov 20 '21 at 07:07
@Arty SIMD CAN do this, especially NOEN with its `sri` and `sli` instructions. But ditch the vector class first. – Jake 'Alquimista' LEE Nov 20 '21 at 07:09
2

Regarding `std::vector` - I just used it as example, in reality I have just two plain memory pointers `uint64_t *src, *dst` which point to two memory locations, and source memory should be shifted to destination. BTW, in C++ `std::vector` is not a linked list but a regular contiguous array, samely organized in memory as `uint64_t arr[SIZE];`. – Arty Nov 20 '21 at 07:09
I just remove `std::vector` reference in my question. You may think that I just have a plain pointer to memory region that should be shifted. – Arty Nov 20 '21 at 07:11
1

`std::vector` is not a linked list. Idk where you heard this but it's not true. A vector in c++ is a contiguous area of memory that is reallocated upon expansion. – Qix - MONICA WAS MISTREATED Nov 20 '21 at 07:19
@Qix-MONICAWASMISTREATED reallocation, that's the worst thing for performance, a common trait of linked list and vector class. They all look the same to me. – Jake 'Alquimista' LEE Nov 20 '21 at 07:33
If one `new`/`memcpy` looks the same as a linked-list to you, you need new glasses. The copying isn't great, but with careful use of `std::vector` you can avoid wasteful copying. A linked list is fundamentally different and horrible, traversing it involves load-use latency; no idea what you're talking about lumping std::vector in with LLs like `std::forward_list`. Also, there is room to gain some performance vs. scalar `shld` by using 128-bit SSE or AVX even without the OR on the fly part; that's what GMP's `mpn_lshift` does on modern x86-64. Especially if data's hot in L2 cache. – Peter Cordes Nov 20 '21 at 07:49
@PeterCordes How did you know that I'm actually looking for new glasses? :-) "with careful use" is the problem. I've seen so many who aren't careful. Especially the Java generation. It must be even worse with the Python generation. And you know me: Anything that eats up performance is evil. – Jake 'Alquimista' LEE Nov 20 '21 at 07:57
Sure, but avoiding performance waste requires understanding. Calling it a linked list doesn't help anyone; the things its slow at are basically opposite to things a linked list is slow at. – Peter Cordes Nov 20 '21 at 08:08
@Jake NEON / SIMD has nothing to do with collection structures. I don't know what your point is. Your prior comments indicate you are indeed talking about std::vector. – Qix - MONICA WAS MISTREATED Nov 20 '21 at 13:55
@Qix-MONICAWASMISTREATED Sorry, it was a misunderstanding. I thought I didn't post it, and I just deleted the comment – Jake 'Alquimista' LEE Nov 20 '21 at 14:23

Efficiently shift-or large bit vector

5 Answers5

Linked