Given a 2D 4x8 nibble matrix, represented as a 16-byte uint8_t array. For every pair of nibbles i, j, the byte is computed as so: (j << 4) | i
.
For example, given the following matrix:
0 1 2 3 3 7 1 9
4 5 6 7 4 1 6 15
8 9 10 11 3 14 6 11
12 13 14 15 8 10 7 4
represented as:
const uint8_t matrix[] = {
0x10, 0x32, 0x73, 0x91,
0x54, 0x76, 0x14, 0xf6,
0x98, 0xba, 0xe3, 0xb6,
0xdc, 0xfe, 0xa8, 0x47,
};
the desired array array would be:
const uint8_t result[] = {
0x40, 0xc8, 0x51, 0xd9,
0x62, 0xea, 0x73, 0xfb,
0x43, 0x83, 0x17, 0xae,
0x61, 0x76, 0xf9, 0x4b,
}
How to implement a function that achieves this most efficiently? Extensions up to AVX2 are fair game.
This is my C implementation so far, based on Nibble shuffling with x64 SIMD. It splits the matrix into two 64bit inputs, unpacks the nibbles, shuffles them and re-packs them.
__m128i unpack_nibbles(__m128i src) {
__m128i nibbles_hi = _mm_srli_epi64(src, 4);
//Interlave high nibbles with full nibbles [0000 hi, hi lo, ...] and clear high
__m128i unpacked = _mm_unpacklo_epi8(src, nibbles_hi);
return _mm_and_si128(unpacked, _mm_set1_epi8(0xf));
}
void transpose_4x8_nibbles(uint8_t *src, uint8_t *dst) {
uint8_t *src_lo = src + 0x8;
__m128i data_hi = _mm_loadl_epi64((__m128i*)src);
__m128i data_lo = _mm_loadl_epi64((__m128i*)src_lo);
data_hi = unpack_nibbles(data_hi);
data_lo = unpack_nibbles(data_lo);
//Transpose
__m128i transpose_mask = _mm_setr_epi8(0, 0x8, 0x1, 0x9, 0x2, 0xa, 0x3, 0xb, 0x4, 0xc, 0x5, 0xd, 0x6, 0xe, 0x7, 0xf);
data_hi = _mm_shuffle_epi8(data_hi, transpose_mask);
data_lo = _mm_shuffle_epi8(data_lo, transpose_mask);
//Pack nibbles
__m128i pack_mask = _mm_set1_epi16(0x1001);
data_hi = _mm_maddubs_epi16(data_hi, pack_mask); //even bytes are multiplied by 0x10, odd bytes by 0x01
data_lo = _mm_maddubs_epi16(data_lo, pack_mask);
__m128i data = _mm_packus_epi16(data_hi, data_lo);
data = _mm_shuffle_epi8(data, transpose_mask);
_mm_store_si128((__m128i*) dst, data);
}