Rotating (by 90°) a bit matrix (up to 8x8 bits) within a 64-bit integer

Question

I have a bit matrix (of size 6x6, or 7x7, or 8x8) stored within one single 64-bit integer.

I am looking for c++ code that rotates these matrices by 90, 180, 270 degrees, as well as c++ code for shifting (horizontally and vertically) and mirroring these matrices. The output must be again a 64-bit integer.

Using some of the advanced CPU instruction sets would probably be okay, as well as using hash tables or similar techniques - speed is of highest importance, and RAM is available. I will run this on an AMD Ryzen 7 1700 eight-core PC. I am not familiar with these instruction sets (e.g. SSE2), but I have used __popcnt64() and _rotl64() within C++.

Can anybody point me in the right direction? I have written my own code for the 7x7 matrix but I now need the code for 6x6 and 8x8 and wonder whether anybody has published anything on this topic, perhaps in a more clever way than my 7x7 approach.

By the way, the 6x6 and 7x7 matrices are stored in the least significant 36 and 49 bits, respectively, with the remaining bits set to zero.

Ryzen unfortunately doesn't have efficient [`pdep`](http://felixcloutier.com/x86/PDEP.html) (6 uops, 18c latency/throughput, unlike Intel where it's 3c latency / 1c throughput: https://agner.org/optimize/), so your best bet is probably to expand bits to SIMD vector elements you can shuffle with `pshufb` or interleave with `punpcklbw`. See [How to perform the inverse of \_mm256\_movemask\_epi8 (VPMOVMSKB)?](https://stackoverflow.com/q/21622212). (but maybe use four 16-bit chunks instead of AVX2 32-bit vectors, if you mostly care about Ryzen). Then shuffle and use `_mm_movemask_epi8` — Peter Cordes, Aug 22 '18 at 22:21
This is in danger of being closed as a request for a library or similar. See [ask]. Just ask what the best way to do it is. google is where you ask if you only want to find an implementation that already exists. Stack Overflow is where you ask what the actual best / most-efficient way to implement it is. — Peter Cordes, Aug 22 '18 at 23:30
there are already many questions about 8x8 rotation [What is the fastest way to rotate the bits in an 8x8 block on bits?](https://stackoverflow.com/q/6930667/995714), [How to efficiently transpose a 2D bit matrix](https://stackoverflow.com/q/41778362/995714), [Flip bits of byte array - performance improvement](https://stackoverflow.com/q/21745008/995714). Now you need to implement the 6x6 one — phuclv, Aug 23 '18 at 02:30
to rotate left 90° a row in a 6x6 matrix or its minor around the secondary diagonal use `mat*0x810204081 & 0x820820820` like [what I used for 8x8](https://stackoverflow.com/a/51653145/995714) — phuclv, Aug 23 '18 at 03:38
Did you consider to store the 7x7 bits as `0, 0, ..., 0, 0, b48, ...b42,...,0, b20, ...,b14, 0, b13, ..., b7, 0, b6, ... , b0`, i.e. with padding of 7 times 1 zero bit in between and 8 extra zero bits at the end. Then you can reuse all the existing 8x8 code on the internet for the 7x7 case as well. You only need one extra bit shift at the end, to get the bits at the right position. — wim, Aug 23 '18 at 11:35
@wim: I was thinking the same thing. Your best bet for 6x6 and 7x7 is unpacking to a row-stride of 8 bits to set up for SIMD byte-manipulation. e.g. `_mm_shuffle_epi8` for 180-degree rotations or mirroring, for the bit-reverse within each byte (using 4-bit lookup tables) + a right-shift by 1 or 2. Scalar `bswap` + shift can byte-reverse, so you probably want to combine the shifts together. Unpacking to that format without pext/pdep sucks. https://godbolt.org/z/6cUhIL design notes for scalar and/andn(BMI1) + ADD or LEA. (masking to move some bits using `x+x = x<<1`, but `x+ (x&mask)`. — Peter Cordes, Aug 23 '18 at 15:13
@PeterCordes Yes indeed. I don't see an efficient way for packing/unpacking from 64 to 36 or 49 bits either, without fast (Intel) pext/pdep. At fmb: Note that a row-stride of 8 bits throughout the whole code would only have a very minor impact on the storage requirements: one 6x6 bit matrix takes 6 bytes in that case instead of 5 bytes for the packed case. In the 7x7 case there is no difference at all. — wim, Aug 23 '18 at 22:58
@wim: AMD `pdep` is 6 uops (but 18c latency/throughput), and might possible be optimal if front-end throughput is the bottleneck. My idea needs maybe 9 uops and 9c latency (including a 3c LEA) for 7x7, but has not port-pressure bottlenecks. So yeah, we definitely want row stride=8 for efficient operations. Interesting point about packing them more tightly; My understanding was that the OP was already padding them out to `uint64_t`, just with all the padding at the end. — Peter Cordes, Aug 23 '18 at 23:22
Thanks a lot for all the responses, this is more content than I can digest at short notice and will keep me busy for a while. Just for the minutes, I hereby ask "what the actual best / most-efficient way to implement it is." - I thought I had expressed this already by asking for a "more clever way" even without disclosing my approach. — fmb, Aug 24 '18 at 06:01

wim · Answer 1 · 2018-08-23T18:31:45.353

In principle AVX2 can be quite useful here. For example, to rotate 90 degrees, you can do:

#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
/*     gcc -O3 -Wall -m64 -mfma -mavx2 -march=skylake rot_bit_mat.c    */ 

int print_bitmat(uint64_t k);

uint64_t bitmat_rot_90(uint64_t x){   /*  0xFEDCBA9876543210     */
    __m256i   mask1   = _mm256_set_epi64x(0x1010101010101010, 0x2020202020202020, 0x4040404040404040, 0x8080808080808080);
    __m256i   mask2   = _mm256_set_epi64x(0x0101010101010101, 0x0202020202020202, 0x0404040404040404, 0x0808080808080808);
    __m256i   x_bc    = _mm256_set1_epi64x(x);                  /* Broadcast x                         */

    __m256i   r_lo    = _mm256_and_si256(x_bc,mask1);           /* Extract the right bits within bytes */
              r_lo    = _mm256_cmpeq_epi8(r_lo,mask1);          /* Test if bits within bytes are set   */
    uint64_t  t_lo    = _mm256_movemask_epi8(r_lo);             /* Move 32 bytes to 32 bit mask        */

    __m256i   r_hi    = _mm256_and_si256(x_bc,mask2);
              r_hi    = _mm256_cmpeq_epi8(r_hi,mask2);
    uint64_t  t_hi    = _mm256_movemask_epi8(r_hi);
              return t_lo | (t_hi << 32);
}


int main(int argc, char **argv){
           /*  0xFEDCBA9876543210 */
  uint64_t k = 0xA49B17E63298D5C3;

  print_bitmat(k);
  printf("\n");
  print_bitmat(bitmat_rot_90(k));
  printf("\n\n");

  return 0;
}

int print_bitmat(uint64_t k){
    uint64_t i,j;
    for (i = 0; i < 8; i++){
        for (j = 0; j < 8; j++){
            printf("%llu",1ull & (k >> (i * 8ull + j)));
        }
        printf("\n");
    }
    return 0;
}

The output is:

It is likely that similar techniques can be used for other transformations. Although it may take some time to figure out the right bit masks.

The comments on the question give directions for other transformations: AVX2 bit reversal of bytes is of interest here, see here and here. Although the latter answer bit reverses 32 bit ints, while in your case bit reversal of 64 bit ints is relevant; so, it needs some modifications. The _bswap64() intrinsic can be used to mirror the bit matrix upside down.

Can you use the same mask twice, with AND and ANDN, then compare against zero vs. compare against the mask? — Peter Cordes, Aug 23 '18 at 15:06
@PeterCordes Indeed ANDN and compare against zero is an alternative, but I don't see a way to get rid of the second mask. — wim, Aug 23 '18 at 18:27
Oh silly me, those masks aren't the inverse of each other. They're selecting different single bits in different nibbles, not opposite nibbles. I'd just been looking at unpacking row-stride=7 bits to row-stride = 8 bits, like I commented on the question, where I did need to mask something both ways. (I wish compilers were better at compressing constants, like generating one from the other with a shift instead of loading them both.) — Peter Cordes, Aug 23 '18 at 18:30

Rotating (by 90°) a bit matrix (up to 8x8 bits) within a 64-bit integer

1 Answers1

Linked