Efficient transpose of 2D nibble matrix?

Question

Given a 2D 4x8 nibble matrix, represented as a 16-byte uint8_t array. For every pair of nibbles i, j, the byte is computed as so: (j << 4) | i.

For example, given the following matrix:

    0  1  2  3  3  7  1  9 
    4  5  6  7  4  1  6  15
    8  9  10 11 3  14 6  11
    12 13 14 15 8  10 7  4

represented as:

const uint8_t matrix[] = {
    0x10, 0x32, 0x73, 0x91,
    0x54, 0x76, 0x14, 0xf6,
    0x98, 0xba, 0xe3, 0xb6,
    0xdc, 0xfe, 0xa8, 0x47,
};

the desired array array would be:

const uint8_t result[] = {
    0x40, 0xc8, 0x51, 0xd9,
    0x62, 0xea, 0x73, 0xfb,
    0x43, 0x83, 0x17, 0xae,
    0x61, 0x76, 0xf9, 0x4b,
}

How to implement a function that achieves this most efficiently? Extensions up to AVX2 are fair game.

This is my C implementation so far, based on Nibble shuffling with x64 SIMD. It splits the matrix into two 64bit inputs, unpacks the nibbles, shuffles them and re-packs them.

__m128i unpack_nibbles(__m128i src) {
    __m128i nibbles_hi = _mm_srli_epi64(src, 4);

    //Interlave high nibbles with full nibbles [0000 hi, hi lo, ...] and clear high
    __m128i unpacked = _mm_unpacklo_epi8(src, nibbles_hi);
    return _mm_and_si128(unpacked, _mm_set1_epi8(0xf));
}

void transpose_4x8_nibbles(uint8_t *src, uint8_t *dst) {
    uint8_t *src_lo = src + 0x8;

    __m128i data_hi = _mm_loadl_epi64((__m128i*)src);
    __m128i data_lo = _mm_loadl_epi64((__m128i*)src_lo);

    data_hi = unpack_nibbles(data_hi);
    data_lo = unpack_nibbles(data_lo);

    //Transpose
    __m128i transpose_mask = _mm_setr_epi8(0, 0x8, 0x1, 0x9, 0x2, 0xa, 0x3, 0xb, 0x4, 0xc, 0x5, 0xd, 0x6, 0xe, 0x7, 0xf);
    data_hi = _mm_shuffle_epi8(data_hi, transpose_mask);
    data_lo = _mm_shuffle_epi8(data_lo, transpose_mask);

    //Pack nibbles
    __m128i pack_mask = _mm_set1_epi16(0x1001);
    data_hi = _mm_maddubs_epi16(data_hi, pack_mask);  //even bytes are multiplied by 0x10, odd bytes by 0x01
    data_lo = _mm_maddubs_epi16(data_lo, pack_mask);
    
    __m128i data = _mm_packus_epi16(data_hi, data_lo);
    data = _mm_shuffle_epi8(data, transpose_mask);
    
    _mm_store_si128((__m128i*) dst, data);
}

@DavidRanieri: Pointer math on `void*` is a GNU extension (working like `char*`), but the question has since been edited to be portable. We can delete these comments. — Peter Cordes, Aug 23 '22 at 02:34
It feels like there *should* be a way to do this without unpack/repack. Like shuffle bytes so the pairs you need are lined up, then vertical bit-blend with AND/ANDN/OR. (And then shuffle again to put the merged bytes where they belong). But a blend doesn't work when we need the same nibble from both, e.g. the first byte of the output is `0x40` from `0x10` ; `0x54` ; the low nibble of each input byte (the starts of the first 2 rows). Perhaps a left-shift of one input by 4 bits can work with 2 blends? — Peter Cordes, Aug 23 '22 at 02:44
I suspect AVX-512 VBMI for vpmultishiftqb (parallel bitfield extract) could be helpful, perhaps along with 16-bit SIMD rotates (also AVX-512). Or Galois Field GFNI for bit-shuffling (http://0x80.pl/articles/avx512-galois-field-for-bit-shuffling.html). So if you're interested in also making a version for those ISA extensions, there might be something to gain. — Peter Cordes, Aug 23 '22 at 02:46
I think I have a solution with 1 `pshufb`, 1 `pmaddwd` and 4 bitops (unfortunately, `pmaddwd` takes signed inputs). I'll write an answer later today. — chtz, Aug 23 '22 at 10:59

score 2 · Accepted Answer · answered Aug 23 '22 at 20:40

Lets name the nibbles as follows (everything in little endian order):

X0 Y0 X1 Y1 X2 Y2 X3 Y3
Z0 W0 Z1 W1 Z2 W2 Z3 W3
X4 Y4 X5 Y5 X6 Y6 X7 Y7
Z4 W4 Z5 W5 Z6 W6 Z7 W7

After transposing we note that X nibbles stay in the low nibble, W nibbles stay in the high nibble, Y nibbles move from high to low and Z nibbles move from low to high:

X0 Z0 X4 Z4
Y0 W0 Y4 W4
X1 Z1 X5 Z5
Y1 W1 Y5 W5
X2 Z2 X6 Z6
Y2 W2 Y6 W6
X3 Z3 X7 Z7
Y3 W3 Y7 W7

That means the X and W nibbles can be placed correctly by a simple pshufb. The Z nibbles all need to be shifted up (or multiplied by 0x10) the Y nibbles need to be shifted down (or multiplied as a uint16 block by 0x1000 and taking the upper half of the result).

One block of 00 Z0 00 Z4 Y0 00 Y4 00 is actually like a 32bit integer and we can almost directly get this from Z0 00 Z4 00 and 00 Y0 00 Y4 by a single pmaddwd instruction with 0x10 and 0x1000:

00 Z0 00 Z4 Y0 00 Y4 00 = (00 Y0 00 Y4)* 0x1000 + (Z0 00 Z4 00) * 0x10

And these nibbles happen to be in the same bytes as X0, X4 and W0, W4 so only one pshufb is necessary to arrange the bytes accordingly, but unfortunately, if Y4>7 we have a negative integer which requires masking away some bits again (at least, we can re-use the same mask).

Overall, this function should do the job:

void transpose_4x8_nibbles(uint8_t const *src, uint8_t *dst) {
    __m128i const input = _mm_loadu_si128((__m128i const*)src);

    __m128i const shuff = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 8, 4, 12, 1, 9, 5, 13, 2, 10, 6, 14, 3, 11, 7, 15));
    __m128i const mask = _mm_set1_epi32(0x0f0ff0f0);
    __m128i const XW = _mm_andnot_si128(mask, shuff);
    __m128i const YZ = _mm_and_si128(mask, shuff);
    __m128i const YZ_trans = _mm_madd_epi16(YZ, _mm_set1_epi32(0x00101000));
    __m128i const result = _mm_or_si128(XW, _mm_and_si128(mask, YZ_trans));

    _mm_storeu_si128((__m128i*)dst, result);
}

Godbolt demo (only SSSE3 is required because of pshufb): https://godbolt.org/z/c43oTz43r

Efficient transpose of 2D nibble matrix?

1 Answers1