1

I have a avx2(256 bit) SIMD vector of bytes that is padded with zeros in front and in the back that looks like this: [0, 2, 3, ..., 4, 5, 0, 0, 0]. The amount of zeros in the front is not known compile-time.

How would I efficiently shift/rotate the zeros such that it would look like this: [2, 3, 4, 5, ..., 0, 0, 0, 0]?

jay jayjay
  • 197
  • 1
  • 12
  • Do you know how many leading zeros there are before you start the shift? – Mike Vine Aug 25 '22 at 20:18
  • well you can easily use `movemask` + lzcnt to determine that – jay jayjay Aug 25 '22 at 20:19
  • 1
    Please be a bit more specific, are these 8 bytes in your vector and the rest are zeroes, or are these 8 int32? Do you actually want rotate, or is shifting sufficient, since only zeros will be at the beginning/end? Is the number of bytes (or elements) you want to shift/rotate known at compile-time? – chtz Aug 25 '22 at 20:34
  • the other end should be zeros, and its not known compile-time. – jay jayjay Aug 25 '22 at 20:48
  • If the pattern is fixed, then PSHUFB in asssembly or as an intrinsic. – rcgldr Aug 25 '22 at 21:27
  • @rcgldr: These appear to be int32_t elements, since there are 8 of them in a 256-bit vector. So `vpshufb` won't work, you need `vpermd` for a lane-crossing shuffle. (And you need to generate a shuffle-control vector somehow, perhaps from vpcmpeqd / vpmovmskb / lzcnt and then loading from a sliding window of `int mask = {0, 1, ..., 6, 7, 0, 1, ..., 6, 7}` or something. (With intrinsics, of course, not hand-written asm.) – Peter Cordes Aug 26 '22 at 00:04
  • But wait, this talks about a "byte vector" so yeah IDK. If it needs a different shuffle in each 8-byte chunk, then yeah `vpshufb`. (If you had AVX-512, [`vplzcntq`](https://www.felixcloutier.com/x86/vplzcntd:vplzcntq) / round to multiple of 8 bits / [`VPROLVQ`](https://www.felixcloutier.com/x86/vprold:vprolvd:vprolq:vprolvq).) – Peter Cordes Aug 26 '22 at 00:08
  • I've been assuming your notation has the most-significant (highest-numbered) element first, like Intel's diagrams in their manuals. But if that's like C-style array notation, with lowest element first, then you want to rotate *right* to bring the first non-zero element to the bottom of the vector or a chunk? – Peter Cordes Aug 26 '22 at 00:10
  • @PeterCordes they're bytes, I just didn't include all of it, will clarify. also, tbh i'm not very clear which direction it is in, but in theory right shift should be the same as left shift in any case – jay jayjay Aug 26 '22 at 00:27
  • Oh FFS, so you actually have 32 bytes that you want to rotate across the whole register? AVX2 can't do that in one shuffle. And yes you can rotate in either direction to get to the state you want, but the count will be different. If you stored the vector to memory, do you want the lowest address to be non-zero, or the highest address? – Peter Cordes Aug 26 '22 at 00:30
  • The lowest address should be non-zero. Will shuffling/some other SIMD be quicker than just storing and then loading? – jay jayjay Aug 26 '22 at 00:34
  • 1
    It *might* make sense to store twice and do an unaligned reload spanning them, despite the store-forwarding stall. That would be good for throughput if you need this for one vector between lots of other work, but bad for doing this in a loop without much other work. (Store-forwarding stalls [don't pipeline with each other, but can pipeline with successful store-forwarding](https://stackoverflow.com/a/69631247/224132). So if you just need this for one vector occasionally, and out-of-order exec can hide the latency, it's not many uops to vpcmpeqb/lzcnt or tzcnt to get a load offset) – Peter Cordes Aug 26 '22 at 00:36
  • 1
    btw. I don't actually need this anymore, but i'll leave it open. – jay jayjay Aug 26 '22 at 01:17
  • Here’s very similar question: https://stackoverflow.com/q/66179765/126995 – Soonts Aug 26 '22 at 22:11

2 Answers2

2

AVX2 has no way to do a lane-crossing shuffle with granularity smaller than 4 bytes. In this case, you'd want AVX-512 VBMI vpermb (in Ice Lake). If you had that, perhaps vpcmpeqb / vpmovmskb / tzcnt on the mask, and use that as an offset to load a window of 32 bytes from a constant array of alignas(64) int8_t shuffles = {0,1,2,...,31, 0, 1, 2, ... 31};. That's your shuffle-control vector for vpermb.


Without AVX-512 VBMI, it might make sense to store twice and do an unaligned reload spanning them, despite the store-forwarding stall. That would be good for throughput if you need this for one vector between lots of other work, but bad for doing this in a loop without much other work.

Store-forwarding stalls don't pipeline with each other, but can pipeline with successful store-forwarding. So if you just need this for one vector occasionally, and out-of-order exec can hide the latency, it's not many uops to vpcmpeqb/tzcnt or lzcnt to get a load offset.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • See [Left shift a vector by runtime variable number of bytes](https://stackoverflow.com/q/73508678) for a full example of loading a sliding window from an array to get a shuffle-control vector. (For `pshufb` for 16-byte shifts with byte granularity.) – Peter Cordes Sep 04 '22 at 18:31
2

If your types are bigger than 32bits.

I can't quite understand the documentation on _mm256_permutevar8x32_epi32 but in practise, adding offset to identity permutation does a rotate - which is what you want (when you already got the number of leading 0s).

__m256i rotate_i32(__m256i w, int offset) {
    __m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
    __m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset));
    return _mm256_permutevar8x32_epi32(w, shuffle);
}

Here is the godbolt: https://godbolt.org/z/Kv8oxs6oY

(-1, -2, -3, -4, -5, -6, -7, -8)
(-2, -3, -4, -5, -6, -7, -8, -1)
(-3, -4, -5, -6, -7, -8, -1, -2)
(-4, -5, -6, -7, -8, -1, -2, -3)
(-5, -6, -7, -8, -1, -2, -3, -4)
(-6, -7, -8, -1, -2, -3, -4, -5)
(-7, -8, -1, -2, -3, -4, -5, -6)
(-8, -1, -2, -3, -4, -5, -6, -7)

The same trick works for 64 bits, but you need to mutliply offset by 2.

__m256i rotate_i64(__m256i w, int offset) {
    __m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
    __m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset * 2));
    return _mm256_permutevar8x32_epi32(w, shuffle);
}

Godbolt: https://godbolt.org/z/85h6aWPsW

Output:

(-1, -2, -3, -4)
(-2, -3, -4, -1)
(-3, -4, -1, -2)
(-4, -1, -2, -3)
Denis Yaroshevskiy
  • 1,218
  • 11
  • 24
  • 1
    Yup, that works too, since `vpermd` only uses the low 3 bits of each element as the shuffle index, adding the same thing to every index index wraps mod 8, giving you a rotate. As in [Left shift a vector by runtime variable number of bytes](https://stackoverflow.com/q/73508678) , this is slightly less efficient than loading a 32-byte window from a 64-byte array using the same index as an offset, and you already need to load a constant to efficiently get a `[7,6,5,4,3,2,1,0]` vector. To save space, either one could "compress" the vector constants by loading them with a `vpmovzxbd` shuffle. – Peter Cordes Sep 04 '22 at 18:36
  • I always assume that the constants are loaded outside of the loop and therefore cost very little. If that does not hold, you are absolutely right. – Denis Yaroshevskiy Sep 04 '22 at 18:40
  • 1
    `_mm256_set1_epi32` costs a shuffle for a `vmovd` + shuffle for a runtime-variable `offset`, and then another `vpaddd`. That's getting close to load-use latency for an L1d hit if you're repeatedly loading different windows from an array. And the load has *better* throughput, too: 2/clock since you'd `alignas(64)` the array to make sure any 32-byte window can't split across two cache lines. – Peter Cordes Sep 04 '22 at 18:48