Nibble shuffling with x64 SIMD

Question

I'm aware of byte shuffling instructions, but I'd like to do the same with nibbles (4-bit values), concretely I'd like to shuffle 16 nibbles in a 64-bit word. My shuffling indices are also stored as 16 nibbles. What's the most efficient implementation of this?

score 6 · Accepted Answer · answered Apr 21 '22 at 01:34

Arbitrary shuffles with a control vector that has to be stored this way? Ugh, hard to work with. I guess you'd have to unpack both to feed SSSE3 pshufb and then re-pack that result.

Probably just punpcklbw against a right-shifted copy, then AND mask to keep only the low 4 bits in each byte. Then pshufb.

Sometimes an odd/even split is easier than widening each element (so bits just stay within their original byte or word). In this case, if we could change your nibble index numbering, punpcklqdq could put the odd or even nibbles in the high half, ready to bring them back down and OR.

But without doing that, re-packing is a separate problem. I guess combine adjacent pairs of bytes into a word in the low byte, perhaps with pmaddubsw if throughput is more important than latency. Then you can packuswd (against zero or itself) or pshufb (with a constant control vector).

If you were doing multiple such shuffles, you could pack two vectors down to one, to store with movhps / movq. Using AVX2, it might be possible to have all the other instructions working on two independent shuffles in the two 128-bit lanes.

// UNTESTED, requires only SSSE3
#include <stdint.h>
#include <immintrin.h>

uint64_t shuffle_nibbles(uint64_t data, uint64_t control)
{
  __m128i vd = _mm_cvtsi64_si128(data);    // movq
  __m128i vd_hi = _mm_srli_epi32(vd, 4);   // x86 doesn't have a SIMD byte shift
  vd = _mm_unpacklo_epi8(vd, vd_hi);       // every nibble at the bottom of a byte, with high garbage
  vd = _mm_and_si128(vd, _mm_set1_epi8(0x0f));  // clear high garbage for later merging

  __m128i vc = _mm_cvtsi64_si128(control);
  __m128i vc_hi = _mm_srli_epi32(vc, 4);
  vc = _mm_unpacklo_epi8(vc, vc_hi);

  vc = _mm_and_si128(vc, _mm_set1_epi8(0x0f));  // make sure high bit is clear, else pshufb zeros that element.
       //  AVX-512VBMI  vpermb doesn't have that problem, if you have it available
  vd = _mm_shuffle_epi8(vd, vc);

       // left-hand input is the unsigned one, right hand is treated as signed bytes.
  vd = _mm_maddubs_epi16(vd, _mm_set1_epi16(0x1001));  // hi nibbles << 4 (*= 0x10), lo nibbles *= 1.

  // vd has nibbles merged into bytes, but interleaved with zero bytes
  vd = _mm_packus_epi16(vd, vd);  // duplicate vd into low & high halves.
  //  Pack against _mm_setzero_si128() if you're not just going to movq into memory or a GPR and you want the high half of the vector to be zero.
  return _mm_cvtsi128_si64(vd);
}

Masking the data with 0x0f ahead of the shuffle (instead of after) allows more ILP on CPUs with two shuffle units. At least if they already had the uint64_t values in vector registers, or if the data and control values are coming from memory so both can be loaded in the same cycle. If coming from GPRs, 1/clock throughput for vmovq xmm, reg means there's a resource conflict between the dep chains so they can't both start in the same cycle. But since we the data might be ready before the control, masking early keeps it off the critical path for control->output latency.

If latency is a bottleneck instead of the usual throughput, consider replacing pmaddubsw with right-shift by 4, por, and AND/pack. Or pshufb to pack while ignoring garbage in odd bytes. Since you'd need another constant anyway, might as well make it a pshufb constant instead of and.

If you had AVX-512, a shift and bit-blend with vpternlogd could avoid needing to mask the data before shuffling, and vpermb instead of vpshufb would avoid needing to mask the control, so you'd avoid the set1_epi8(0x0f) constant entirely.

clang's shuffle optimizer didn't spot anything, just compiling it as-written like GCC does (https://godbolt.org/z/xz7TTbM1d), even with -march=sapphirerapids. Not spotting that it could use vpermb instead of vpand / vpshufb.

shuffle_nibbles(unsigned long, unsigned long):
        vmovq   xmm0, rdi
        vpsrld  xmm1, xmm0, 4
        vpunpcklbw      xmm0, xmm0, xmm1        # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        vmovq   xmm1, rsi
        vpsrld  xmm2, xmm1, 4
        vpunpcklbw      xmm1, xmm1, xmm2        # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
        vmovdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
        vpand   xmm0, xmm0, xmm2
        vpand   xmm1, xmm1, xmm2
        vpshufb xmm0, xmm0, xmm1
        vpmaddubsw      xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
        vpackuswb       xmm0, xmm0, xmm0
        vmovq   rax, xmm0
        ret

(Without AVX, it requires 2 extra movdqa register-copy instructions.)

Curses. I thought I'd come up with a clever way to recombine the nibbles after shuffling - OR'ing shifts and a final shuffle into [63:0]... this is much better. SIMD instruction sets are getting too hard for me to remember as they keep expanding. — Brett Hale, Apr 22 '22 at 14:24
Was there a particular reason for `_mm_srli_epi32` instead of, say, `_mm_srli_epi64` in this case? — Brett Hale, Apr 22 '22 at 14:26
@BrettHale: I figure 32-bit element size is likely to be at least as fast as anything else on any CPU, if there's ever any difference, except for multiplies. For some operations (like `paddq` and `pcmpgtq`) 64-bit element size has a longer opcode and/or is slower on some CPUs, so I never go with `epi64` when other sizes would be equally good (except for shuffles where fewer larger elements *is* faster). For materializing an all-ones register, IIRC GCC picks `pcmpgtd`, which I think is what led me to start choosing 32 as a "default" when any size worked. — Peter Cordes, Apr 22 '22 at 20:46
@BrettHale: 32 thus seems a good default to me when you have to emulate some other size. (In this case `..._epi8` via shift and masking). The closest available size to what we want is `..._epi16`, which makes it the other intuitive choice. I avoid that partly because I want to remind beginners that this emulation technique does not depend on the shift width being only "one size bigger" than the shift width we want to emulate, that there'd be nothing special about 16-bit for this. — Peter Cordes, Apr 22 '22 at 20:52
@BrettHale: There is no downside to `psrlq xmm,imm8` (`_mm_srli_epi64`) though. It has the same code-size, and performance on all(?) CPUs as `psrld`/`w`. So it is a fully arbitrary choice. That makes sense, unlike carry-propagation for addition, wider shifts aren't much harder. One wider barrel shifter vs. multiple narrower ones does maybe mean more gate-delays, so it's imaginable that some CPUs might not run it as efficiently. (It's even possible you might save negligible amounts of power by using 16-bit elements. Probably less total ever than it took to read this comment, though :/) — Peter Cordes, Apr 22 '22 at 20:59
I've included a [test](https://github.com/brettyhale/so-snippets/blob/main/so.71936833.c), with your code prologue / epilogue to the shuffle. Some test vectors included: [https://godbolt.org/z/qMca4sPbh](https://godbolt.org/z/qMca4sPbh) — Brett Hale, Apr 24 '22 at 12:58
@BrettHale: Thanks. Interesting that clang14 deoptimizes your `bslli` by 1 into a `pshufb`. clang11 optimizes it into `vpsllw xmm0, xmm0, 8` because I guess it sees that the final `pshufb` is only taking the odd elements, so zeroing the evens doesn't matter. Are two separate shifts really necessary? Can we just right shift `hi` by 4, instead of left by 4 and left by 8? `hi` nibbles are at odd positions, so they don't need to cross a 16-bit boundary, so yeah, I think `hi = v_dst >> 4` (with any element size) / `v_dst |= hi;` would work, then your shuffle constant takes evens instead of odd — Peter Cordes, Apr 24 '22 at 13:15
@BrettHale: Yup, that does work. https://godbolt.org/z/Y655YKqhG `v_dst | (v_dst>>4)` takes advantage of GNU C syntax where `__m128i` is a vector of `unsigned long long` elements. Of course you can `_mm_` whatever it. My first attempt using `v_dst | (hi>>4)` compiled using some old `hi` variable >.< No good solution to that naming problem; Inventing a new name for every temporary makes it easier to accidentally reference the wrong old value, not helping defend against it. — Peter Cordes, Apr 24 '22 at 13:29
@BrettHale: Heh, I know the feeling. I've definitely come up with over-complicated things that other SO users (or sometimes clang's shuffle optimizer) were able to improve on. — Peter Cordes, Apr 24 '22 at 13:46
@BrettHale: Interesting, clang can fully constant-propagate through your shift/or/pshufb way (after inlining into main), but not for my pmadd/pack way. GCC inlines the whole thing, starting from loading constants and vpand. :/ — Peter Cordes, Apr 24 '22 at 13:59
I know it's a deterministic result, but that's pretty impressive constant propagation. — Brett Hale, Apr 24 '22 at 14:33
@BrettHale: Yeah, clang/LLVM tracks data through shuffles, that's how it can re-optimize (or pessimize) shuffles, and notice stuff like `plllw` being sufficient because only the odd elements are being read. Once it has that impressive machinery in place, constant-propagation should be easy through the `and` and bit-shift / `or` operations. — Peter Cordes, Apr 24 '22 at 22:24

score 2 · Answer 2 · answered Oct 22 '22 at 19:44

I came across this problem today. In AVX-512 you can use vpmultishiftqb (1), an amusing instruction available in Ice Lake and after (and apparently in Zen 4, according to Wikipedia), to shuffle nibbles much more quickly. Its power lies in its ability to permute bytes in an unaligned fashion: It takes the eight 8-bit chunks in each 64-bit element and selects unaligned 8-bit chunks from the corresponding element. Below is an implementation.

#include <immintrin.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

// Convention: (a & (0xf << (4 * i))) >> (4 * i) is the ith nibble of a
// (i.e., lowest-significant is 0)
uint64_t shuffle_nibbles(uint64_t data, uint64_t indices) {
#if defined(__AVX512VBMI__) && defined(__AVX512VL__)
    // If your data is already in vectors, then this method also works in parallel
    const __m128i lo_nibble_msk = _mm_set1_epi8(0x0f);
    __m128i v_data = _mm_cvtsi64_si128(data);
    __m128i v_indices = _mm_cvtsi64_si128(indices);

    __m128i indices_lo = _mm_and_si128(lo_nibble_msk, v_indices);
    __m128i indices_hi = _mm_andnot_si128(lo_nibble_msk, v_indices);
    indices_lo = _mm_slli_epi32(indices_lo, 2);
    indices_hi = _mm_srli_epi32(indices_hi, 2);

    // Look up unaligned bytes
    __m128i shuffled_hi = _mm_multishift_epi64_epi8(indices_hi, v_data);
    __m128i shuffled_lo = _mm_multishift_epi64_epi8(indices_lo, v_data);

    shuffled_hi = _mm_slli_epi32(shuffled_hi, 4);
    // msk ? lo : hi
    __m128i shuffled = _mm_ternarylogic_epi32(lo_nibble_msk, shuffled_lo, shuffled_hi, 202);

    return _mm_cvtsi128_si64(shuffled);
#else
    // Fallback scalar implementation (preferably Peter Cordes's SSE solution--this is as an example)
    uint64_t result = 0;
    for (int i = 0; i < 16; ++i) {
        indices = (indices >> 60) + (indices << 4);

        int idx = indices & 0xf;
        result <<= 4;
        result |= (data >> (4 * idx)) & 0xf;
    }

    return result;
#endif
}

int main() {
        // 0xaa025411fe034102
        uint64_t r1 = shuffle_nibbles(0xfedcba9876543210, 0xaa025411fe034102);
        // 0x55fdabee01fcbefd
        uint64_t r2 = shuffle_nibbles(0x0123456789abcdef, 0xaa025411fe034102);
        // 0xaaaa00002222aaaa
        uint64_t r3 = shuffle_nibbles(0xaa025411fe034102, 0xeeee11110000ffff);

        printf("0x%" PRIx64 "\n", r1);
        printf("0x%" PRIx64 "\n", r2);
        printf("0x%" PRIx64 "\n", r3);
}

Clang yields (2):

.LCPI0_0:
        .zero   16,60
shuffle_nibbles(unsigned long, unsigned long):
        vmovq   xmm0, rdi
        vmovq   xmm1, rsi
        vpslld  xmm2, xmm1, 2
        vpsrld  xmm1, xmm1, 2
        vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60]
        vpand   xmm1, xmm1, xmm3
        vpmultishiftqb  xmm1, xmm1, xmm0
        vpand   xmm2, xmm2, xmm3
        vpmultishiftqb  xmm0, xmm2, xmm0
        vpslld  xmm1, xmm1, 4
        vpternlogd      xmm1, xmm0, dword ptr [rip + .LCPI0_1]{1to4}, 216
        vmovq   rax, xmm1

In my case, I am shuffling nibbles in 64-bit-element vectors; this method also avoids the need for widening. If your shuffle(s) is/are constant and you stay in vectors, this method reduces to a measly four instructions: 2x vpmultishiftqb, 1x vpslld, and 1x vpternlogd. Counting µops suggests a latency of 5 and throughput of one every 2 cycles, bottlenecked on shuffle µops, for 128- and 256-bit vectors; and a throughput of 3 for 512-bit vectors, due to reduced execution units for the latter two instructions.

vpmultishiftqb is basically a parallel bitfield extract (or a building block for that). Very handy for integer -> hex string which also involves nibble manipulations (but with *fixed* movement, not like here), possible in basically 2 instructions with vpmultishift (with a broadcast memory source) + vpermb. [How to convert a binary integer number to a hex string?](https://stackoverflow.com/q/53823756) But here where you need to recombine back to packed nibbles in a different order, yeah probably doing it 2x odd/even makes sense, instead of broadcasting. — Peter Cordes, Oct 22 '22 at 20:05
Lovely. After learning SIMD optimization with AVX2, I must say that AVX512 has been a joy to use. Wish it were more widely available, even just sticking with 256-bit vectors. — Ovinus Real, Oct 22 '22 at 20:12
Hell yeah. I'm really unimpressed with Intel dropping the ball on it, and keeping all those amazingly useful instructions locked away without a way to expose 256-bit-only support for them. Or for not providing actually AVX-512 for market segmentation(?) in the successors to Skylake-client which already had the hardware in silicon. And even regressing support in Alder Lake because 512-bit vectors are expensive and they didn't want to implement it in E-cores. (And worse, won't enable it even with no E-cores in later models :/) Zen4 has the right idea. — Peter Cordes, Oct 22 '22 at 20:19

Nibble shuffling with x64 SIMD

2 Answers2

Linked