how to implement a 256 bit rotate using avx2

Question

Given a value in an avx2 register, I would like to mask (with AND) then rotate by k bits There does not appear to be a rotate instruction for the entire 256 bits but there is for each 64 bits:

// this is the desired bit pattern
// ...0111110111110111110111110111110
// set every kth bit to 0
inline __m256i setkthzero(const uint32_t k) {
  const uint64_t rotate_by = 64 % k; // each 64 bit word shifts
  __m256i t = set1();

  //    for (uint32_t i = 0; i < 256; i += k) {
  //    t &= ~(1 << i); // obviously not AVX2, how to do this?
  //}
  uint64_t ta = 0xFFFFFFFFFFFFFFFEULL; // low bit set to zero
  uint64_t tb = (ta << rotate_by) | (ta >> (64-rotate_by)); // c++ rotate
  uint64_t tc = (tb << rotate_by) | (tb >> (64-rotate_by)); // c++ rotate
  uint64_t td = (tc << rotate_by) | (tc >> (64-rotate_by)); // c++ rotate

  __m256i mask =  _mm256_set_epi64x(td, tc, tb, ta);
  for (uint32_t i = 64; i > 0; i -= k) {
    __m256i shift = _mm256_rol_epi64(mask, k);
    mask = _mm256_and_si256(mask, shift);
  }
  return mask;
}

I cannot test the above code because my CPU does not support avx512 for the _mm256_rol_epi64. So first question, is there some other way in avx2 where I can reasonably do this, and second, once I have these masks, how can I rotate the entire mask by m bits?

From what I gather there isn't a single instruction, but is there any way to construct the operation? I can't think of one.

__m256i mask = setkthzero(6);

// advance to next position...
t = rol(mask, 2); // how to rotate mask by m=2 bits?
...
t = rol(mask, 1);

__m256i mask2 = setkthzero(10);
t = rol(mask2, 2);
...
t = rol(mask2, 4);
...
t = rol(mask2, 6);

you can use https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html to test instructions not supported by your CPU — Alan Birtles, Feb 08 '22 at 14:30
It helps to remember that AVX2 is a pretty fair description: it's just AVX times 2. It may look like a single register, but it's just two 128 bits registers on which you run the same function. Intel calls that "lanes". I call it lame. In particular, you ask for a cross-lane operation. That can't work.; the 127th bit and 128th bit are in different lanes. — MSalters, Feb 08 '22 at 16:48
Do you want to generate a mask where every kth bit is 0, or do you want to rotate a 256bit vector? In the latter case, do you read your vector from memory and are you able to read beyond the size of the vector? And do you know `k` at compile time? — chtz, Feb 08 '22 at 17:10
@MSalters: In this case, it's not like SSE or AVX had any 128-bit bit-shifts / rotates either. The only whole-register or whole-128bit-lane horizontal data movement ops have at least byte granularity (like `palignr` which can byte-rotate in 128-bit chunks, or `_mm_bslli_si128` aka `pslldq` byte shift) — Peter Cordes, Feb 08 '22 at 17:52
[`_mm256_rol_epi64` is AVX512-only](https://www.felixcloutier.com/x86/vprold:vprolvd:vprolq:vprolvq). It can be relatively cheaply emulated with `sll` / `srl` / `or`, though, to just rotate within each SIMD element. But if you're masking anyway, probably just use a shift `1ULL< — Peter Cordes, Feb 08 '22 at 17:56
@chtz Note that the above code attempts to create a mask where every kth bit is zero using 64 bits at a time, but then I want to be able to rotate the resulting 256 bit mask somehow. So there is no way? Yes, I know k at compile time. take k=6 — Dov, Feb 08 '22 at 22:30
Your rotate count is also a compile-time constant? That probably can be done more efficiently than regenerating the mask with an offset, especially for this algorithm where a compile-time-constant `k` isn't factored in until after you already have a SIMD vector. (Not as part of the `0xFF...FE` masks.) So you know which bits have to move across element boundaries; that's not dependent on a runtime variable which would make it expensive. — Peter Cordes, Feb 08 '22 at 22:39
If you don't mind writing code with shuffles designed around its value, then do that. Since you just have this repeating bit-pattern, it doesn't repeat in each 128 bits does it? I guess not unless `k` is a power of 2. Anyway, you might want to have a look at [Emulating shifts on 32 bytes with AVX](https://stackoverflow.com/q/25248766) re: emulating shifts across all the bits in a vector; doing rotates involves taking some bits from the top down to the bottom instead of zeroes, but with luck the same shuffle can take bits from low to high as well as high to low. — Peter Cordes, Feb 08 '22 at 22:55
If it's just about creating a mask with 0s at certain points (with the distances of the zeros known at compile time, or at least known outside any hot loop), I would just track in each 32 (or 64) bit element how much this need to be shifted at each iteration -- or maybe even go at byte-level and do a `pshufb`-LUT (I haven't thought about details yet). It may also be a good idea to have the inverted mask, so you can shift in zeros for free (depends on how you want to use the mask later on, of course). — chtz, Feb 08 '22 at 23:14
Actually, your goal is still unclear to me. If your first mask has 0s every 10th bit, that would mean 0s at positions 0, 10, ..., 250. If you "rol" that by 6 bits, you would have 0s at positions 0, 6, 16, ..., 246. That means bits 0, 6 would be only 6 apart. — chtz, Feb 09 '22 at 00:54
If you want to rotate by multiples of 2 bits, maybe implement rotates by 2 and 4. Instead of doing all rotates from the original mask, maybe do `t = rol(mask, 2)` ; `t = mask = rol(mask, 4)` ; `t = rol(mask, 2)` ; ... so you have some ILP, with odd multiples of 2 coming from multiples of 4. That's relevant if you're going much beyond 6, especially in a loop, to get some code reuse and save on mask constants if there are any. If you really only need 3 different masks, just write some `constexpr` code so you're loading vector constants. — Peter Cordes, Feb 09 '22 at 02:48

how to implement a 256 bit rotate using avx2

0 Answers0