Rotation by 8x8 is more efficiently done by SIMD or SWAR technique, that can read at least 64 bits at a time.
Rot90Left(X) = flip_vertical(transpose(X))
Rot90Right(X) = transpose(flip_vertical(X))
Vertical flip is a zero-cost operation, as it only means storing/reading from the opposite end of temporary variables.
If the SSE / SIMD implementation of the transpose can't be used, this kernel has proven quite fast on x64 and arm64-v8.
inline void transpose_u8(uint64_t *a, uint64_t *b) {
uint64_t A = *a, B = *b, C = B ^ (A>>8)) & 0x00ff00ff00ff00ffull;
*a = A ^ (C << 8);
*b = B ^ C;
}
inline void transpose_u16(uint64_t *a, uint64_t *b) {
uint64_t A = *a, B = *b, C = B ^ (A>>16)) & 0x0000ffff0000ffffull;
*a = A ^ (C << 16);
*b = B ^ C;
}
inline void transpose_u32(uint64_t *a, uint64_t *b) {
uint64_t A = *a, B = *b, C = B ^ (A>>32)) & 0x00000000ffffffffull;
*a = A ^ (C << 32);
*b = B ^ C;
}
void transpose8x8(uint8_t *src, int skip0, uint8_t *dst, int skip1) {
uint64_t d[8];
for (int x = 0; x < 8; x++)
memcpy(d+(x ^ LEFT), src + x * skip0);
transpose_u8(d+0, d+1);
transpose_u8(d+2, d+3);
transpose_u8(d+4, d+5);
transpose_u8(d+6, d+7);
transpose_u16(d+0, d+2);
transpose_u16(d+1, d+3);
transpose_u16(d+4, d+6);
transpose_u16(d+5, d+7);
transpose_u32(d+0, d+4);
transpose_u32(d+1, d+5);
transpose_u32(d+2, d+6);
transpose_u32(d+3, d+7);
for (int x = 0; x < 8; x++)
memcpy(dst + x * skip1, d + (x ^ RIGHT));
}
Here the right rotation happens by setting LEFT=0, RIGHT=7
Left rotation == LEFT=7, RIGHT = 0
Transpose = LEFT=0, RIGHT=0
My hypothesis is that any decent compiler will replace all the internal memory reads in the transpose_uXX
functions by directly modifying variables stored in registers and will replace the memcpy
by single 64-bit read or write to memory -- this should happen at least with 64-bit architectures. On pre-x86 there won't be enough registers and the practical alternative is to use which ever SIMD register and instruction set is available.