Extracting edges of AVX2 16x16 bitmatrix

Question

Is there a relatively cheap way to extract the four edges (rows 0 and 15, and columns 0 and 15) of a 16x16 bitmatrix stored in a __m256i into four 16b lanes of a __m256i? I don't care which lanes the output is to, or if there is garbage in the rest of the register. Mild preference for all of them to be in the low half, but only mild.

Extracting the 'top' and 'bottom' are easy - it's just the first and last 16b elements of the vector, done - but the sides are another matter. You need the first and last bits of each 16b element, which gets complicated.

You can do it with a full bit-transpose, like so:

// Full bit-transpose of input viewed as a 16x16 bitmatrix.
extern __m256i transpose(__m256i m);

__m256i get_edges(__m256i m) {
    __m256i t = transpose(m);
    // We only care about first and last u16 of each
    // m = [abcdefghijklmnop]
    // t = [ABCDEFGHIJKLMNOP]
    m = _mm256_permutevar8x32_epi32(m, _mm256_set_epi32(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0));
    // m = [............a..p]
    t = _mm256_permutevar8x32_epi32(t, _mm256_set_epi32(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0));
    // m = [............A..P]

    __m256i r = _mm256_unpacklo_epi16(t, m);
    // r = [........aA....pP]
    return r; // output in low and high dwords of low half
}

... but that just reduces one surprisingly annoying problem to another surprisingly annoying problem - I can't see how to cheaply do a full bit-transpose of a __m256i.

Ditto, there might be something _mm256_movemask_epi8-esque that could do the trick - but nothing jumps out at me.

Is there a better approach?

Yeah, `_mm256_movemask_epi8` is the key I think; use it to get the high bits (interleaved with garbage), then `movemask_epi8( v<<15 )` to get the low bits. Packing those down to remove the garbage (or zeros) is trivial with BMI2 `pext`, but if you need this to be fast on Zen and Zen 2 (not just Intel), then that's harder. There's unfortunately no `movemask_epi16`, and unpacking to 32-bit elements for 2x `movemask_epi32` (e.g. with `vpmovsxwd`) would take significantly more instructions. — Peter Cordes, Dec 31 '22 at 04:17
The output has to be a `__m256i`? You don't eventually want them as separate 16-bit scalar integers? So if we do use movemask and scalar `pext`, we'd need to get them back into a vector. — Peter Cordes, Dec 31 '22 at 04:18
Related: [How to create a byte out of 8 bool values (and vice versa)?](https://stackoverflow.com/a/51750902) might be usable for packing 2-bit fields down to 1. But probably not, those multiply tricks usually require each element to be large enough to not overlap garbage into the part of the product you want. [Does the x86 architecture support packing bools as bits to parallelize logic operations?](https://stackoverflow.com/a/74224417) mentions that and some AVX-512 stuff; AVX-512 would make this very easy with `vpmovw2m` and `vptestmw` — Peter Cordes, Dec 31 '22 at 04:23
[How to efficiently de-interleave bits (inverse Morton)](https://stackoverflow.com/q/4909263) / [How to de-interleave bits (UnMortonizing?)](https://stackoverflow.com/q/3137266) has some good answers. Packing the two `vpmovmskb` results into a 64-bit integer, those scalar bithacks could be applied to the combination to get the concatenation of the two 16-bit elements you want, ready for a `vmovd` or `vpinsrd` back into a vector. — Peter Cordes, Dec 31 '22 at 04:26
Alright, thanks for the pointers. This is looking like it'll be more expensive than expected; I'll have to rethink this I think. — TLW, Dec 31 '22 at 08:01
It's only a half a dozen uops if you have fast PEXT (Intel since Haswell, AMD since Zen 3). 2x `vpmovmskb` before/after `vpsllw ymm,15`, then `shl`/`or` to merge and `pext` to grab them both. Or 2x `pext` to have them separately. (Plus whatever vector shuffle you want to for the top/bottom rows, e.g. `vextracti128` / `vpalignr low, high, 2` to get the top and bottom 16-bit chunks together.) — Peter Cordes, Dec 31 '22 at 08:13

Peter Cordes · Accepted Answer · 2023-01-03T05:05:13.017

With fast BMI2 pext (Haswell or Zen 3 and later), that's one option if you start with vpmovmskb + shift + vpmovmskb to get the bits of the edges (interleaved with garbage bits, since we want every 16th but we get every 8th).

9 uops for the front-end, 6 of them needing port 5 on Intel Skylake-family. (Not counting the integer constant setup, assuming you'd do this in a loop. If not, that also counts against this.)

__m128i edges_zen3_intel(__m256i v)
{
    __m128i vtop_bottom = _mm256_castsi256_si128( 
          _mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
    // vpermq: 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.

   // side bits interleaved with garbage
   // without AVX-512 we can only extract a bit per byte, dword, or qword
   unsigned left = _mm256_movemask_epi8(v);   // high bit of each element
   unsigned right = _mm256_movemask_epi8( _mm256_slli_epi16(v, 15) );  // low<<15

//   left = _pext_u32(left, 0xAAAAAAAAul);  // take every other bit starting with #1
//   right = _pext_u32(right, 0xAAAAAAAAul);
    // then combine or do whatever

   uint64_t lr = ((uint64_t)left << 32) | right;
   lr = _pext_u64(lr, 0xAAAAAAAAAAAAAAAAull);

   //__m128i vsides = _mm_cvtsi32_si128(lr);
   __m128i vtblr = _mm_insert_epi32(vtop_bottom, lr, 1);  // into an unused space
   // u16 elems: [ top | x | x | x | left | right | x | bottom ]

   return vtblr;
}

This compiles to 10 uops for Intel CPUs (and Zen 4), including getting everything back into one SIMD vector. The movabs can be hoisted out of loops. SHL/OR don't compete for SIMD execution-port throughput (able to run on port 6 on Intel), but do compete for the front-end. Godbolt

# Haswell/Sklake uop counts
edges_zen3_intel(long long __vector(4)):
        vpsllw  ymm2, ymm0, 15             # p0 (or p01 on Skylake)
        vpmovmskb       eax, ymm0          # p0
        vpermq  ymm1, ymm0, 12             # p5
        vpmovmskb       edx, ymm2          # p0
        sal     rax, 32                    # p06
        or      rax, rdx                   # p0156
        movabs  rdx, -6148914691236517206  # p0156 (and can be hoisted out of loops)
        pext    rax, rax, rdx              # p1
        vpinsrd xmm0, xmm1, eax, 1         # 2 p5.  On Intel, both uops compete with shuffles
        ret

As a variation, we could maybe get left and right edges together for one vpmovmskb, if we can left-shift the odd bytes but not the evens? Probably not, _mm256_maddubs_epi16 with _mm256_set1_epi16(0x0180) can't do that, it adds horizontal pairs, and a left-shift of 7 (0x80 = 1<<7) isn't enough, we'd need 8 to get the top bit back to the top.

Or if we vpsllw + vpacksswb, then use the right masks to group bits, like 0x00ff00ff. But that's getting closer to my non-pext idea, maybe it's better even if we do have fast pext

Without fast BMI2 `pext` - saturating pack the vector to reduce to 8-bit elements

This might be faster even if `pext` is fast.

Packing with signed saturation always preserves the sign bit, so you can narrow 16 to 8-bit without losing the information you want to keep. We want to do this to the high and low bit of each word (16-bit element), so a 2:1 pack with the original and v<<15 is perfect.

Except for the fact that AVX2 vpacksswb ymm is two separate in-lane pack operations, so we end up with 8-element chunks interleaved. We could fix that up right after packing with vpermq, but it's multiple uops on Zen 1 through Zen 3, and we can instead shuffle bytes after getting the movemask result back into a vector register. (The same vpshufb can move around the high and low elements.)

// avoiding PEXT because it's slow on Zen 2 and Zen 1 (and Excavator)
// This might be good on Intel and Zen 3, maybe comparable to using PEXT
__m128i edges_no_pext(__m256i v)
{
    __m128i vhi = _mm256_extract_si128(v, 1);  // contains top, as vhi.u16[7]
    __m128i vlo = _mm256_castsi256_si128(v);   // contains bottom, as vlo.u16[0], contiguous if concatenated the right way
    __m128i bottom_top = _mm_alignr_epi8(vhi, vlo, 12);  // rotate bottom :top down to the 2nd dword [ x | x | bottom:top | x]

   // vpermq ymm, ymm, imm would also work to get them into the low 128
   // but that's 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
   // and would need a slightly more expensive vpinsrd instead of vmovd+vpblendd

   // On Intel CPUs (and Zen4) vpermq is better; we pshufb later so we can get the bytes where we want them.
   // A compromise is to use vextracti128+vpblendd here, vpinsrd later
   //   __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);
                    // [ hi | x | x | x   |   x | x | x | lo ]

    __m256i vright = _mm256_slli_epi16(v, 15);
    __m256i vpacked = _mm256_packs_epi16(v, vright);   // pack now, shuffle bytes later.
    unsigned bits = _mm256_extract_epi8(vpacked);    // [ left_hi | right_hi | left_lo | right_lo ]

    __m128i vsides = _mm_cvtsi32_si128(bits);
    __m128i vtblr = _mm_blend_epi32(top_bottom, vsides, 0b0001);  // vpinsrd xmm0, eax, 0 but the merge can run on more ports

    __m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
                                   7,6,5,4, 3,1, 2,0);
     // swap middle 2 bytes of the low dword, fixing up the in-lane pack
     vtblr = _mm_shuffle_epi8(vtblr, shuffle);
     return vtblr;   // low 4 u16 elements are (MSB) top | bottom | left | right  (LSB)
}

This compiles pretty nicely (see earlier Godbolt link), although GCC4.9 and later (and clang) pessimize my vmovd+vpblendd into vpinsrd, even with -march=haswell or Skylake where it's 2 uops for port 5 (https://uops.info/) when most of the other instructions in the function are also shuffles that only run on port 5. (This is much more shuffle-heavy for Intel CPUs.)

Using vpblendd instead of vpalignr would make it less bad for Intel, like __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);, to get to the same situation as in the vpermq version below with 2 uops even on Zen 1. But this is just saving 1 uop on Zen 1 and is equal or worse everywhere else.

# GCC12 -O3 -march=haswell
# uop counts for Skylake
edges_no_pext:
        vextracti128    xmm1, ymm0, 0x1        # p5
        vpsllw  ymm2, ymm0, 15                 # p01
        vpalignr        xmm1, xmm1, xmm0, 12   # p5
        vpacksswb       ymm0, ymm0, ymm2       # p5
        vpmovmskb       eax, ymm0              # p0
        vpinsrd xmm0, xmm1, eax, 0             # 2 p5
        vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]  # p5
        ret

So that's 6 uops for port 5 on Intel, a throughput bottleneck of 1 per 6 cycles. vs. the PEXT version being 3 uops that need port 0, 3 that need port 5. But this is only 8 total uops for the front-end, vs. 9 for the pext version. And the vpermq version saves one more on Intel, assuming GCC doesn't waste the vmovdqa after inlining.

If you didn't care about zeroing the upper 8 bytes of the output vector, the shuffle constant could be loaded with vmovq and just be 8 bytes instead of 16 (if you made the upper 0 bytes all zeros). But compilers will probably not spot that optimization.

Since compilers insist on pessimizing to vpinsrd, on CPUs with fast vpermq (Intel and Zen4), we might as well use that:

If you're only going to have one non-GFNI AVX2 version, this is probably a good tradeoff

vpermq being 3 uops on Zen 1 isn't much worse than emulating what we need from it using 2 instruction, and is worse on Intel CPUs. And probably about break-even on Zen 2 and Zen 3, modulo differences in back-end port usage.

// for fast vpermq, especially if compilers are going to pessimize vmovd(p5)+vpblendd (p015) into vpinsrd (2p5).
// good on Intel and Zen 4, maybe also Zen 3 and not bad on Zen 2.
__m128i edges_no_pext_fast_vpermq(__m256i v)
{
   __m128i vtop_bottom = _mm256_castsi256_si128( 
            _mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
    // 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.

    __m256i vright = _mm256_slli_epi16(v, 15);
    __m256i vpacked = _mm256_packs_epi16(v, vright);   // pack now, shuffle bytes later.
    unsigned bits = _mm256_movemask_epi8(vpacked);    // [ left_hi | right_hi | left_lo | right_lo ]

    __m128i vtblr = _mm_insert_epi32(vtop_bottom, bits, 1);  // into an unused space
    // u16 elems: [ top | x | x | x | lh:rh | ll:rl | x | bottom ]
    __m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
                                   15,14, 1,0, 7,5, 6,4);
     vtblr = _mm_shuffle_epi8(vtblr, shuffle);
     return vtblr;   // low 4 u16 elements are (MSB) top | bottom | left | right  (LSB)
}

# GCC12.2 -O3 -march=haswell     clang is similar but has vzeroupper despite the caller passing a YMM, but no wasted vmovdqa
edges_no_pext_fast_vpermq(long long __vector(4)):
        vmovdqa ymm1, ymm0
        vpermq  ymm0, ymm0, 12
        vpsllw  ymm2, ymm1, 15
        vpacksswb       ymm1, ymm1, ymm2
        vpmovmskb       eax, ymm1
        vpinsrd xmm0, xmm0, eax, 1
        vpshufb xmm0, xmm0, XMMWORD PTR .LC1[rip]
        ret

On Intel Haswell/Skylake, this is 5 uops for port 5, plus a shift (p01) and vpmovmskb (p0). So 7 total uops. (Not counting the ret or the wasted vmovdqa that should go away with inlining.)

On Ice Lake and later, one of the uops from vpinsrd can run on p15, relieving one uop of pressure on that port if you're doing this in a loop. vpinsrd is single-uop on Alder Lake E-cores.

Ice Lake (and later) can also run vpshufb on p1/p5, further reducing port 5 pressure, down to 3 of the 7 uops. Port 5 can handle any shuffle, port 1 can handle some but not all shuffle uops. It may be hooked up to the upper half of the 512-bit shuffle unit to give extra throughput for some 256-bit and narrower shuffles, like how the p0/p1 FMA units work as a single 512-bit FMA unit on p0. It doesn't handle vpermq or vpacksswb; those are still p5 only on Ice/Alder Lake.

So this version is pretty reasonable on current-generation and future Intel CPUs. Alder Lake E-cores run vpermq ymm as 2 uops with 7 cycle latency. But if they can hide that latency with their more limited out-of-order scheduling (big ROB, but queues for each port aren't as long), running vpinsrd as a single uop helps make up the front-end throughput.

256-bit instructions like vpsllw ymm and vpacksswb ymm are also 2 uops each on Alder Lake E-cores, but vpmovmskb eax,ymm is 1 uop (but maybe high-ish latency). So even if we wanted to make a version optimized for Zen1 / Alder E, we probably can't save total uops on them by using more 128-bit instructions after vextracti128; we still need to do stuff to both halves of the input vector.

I had looked at packing into the right order for vpmovmskb xmm to get each 16-bit group in the right order, but separately. I had considered doing this with vperm2i128, but that's quite slow on Zen 1.

//    __m256i vcombined = _mm256_permute2x128_si256(v, vright, 0x10);  // or something?  Takes two shuffles to get them ordered the right way for pack

Zen 1 has very fast vextracti128 - is single-uop for any port, and 128-bit vector ops are 1 uop vs. 2 for __m256i operations. And where we're already doing that extract to get the top and bottom together.

But it still leads to more scalar work, especially if you want the result combined in a vector. 2x vpinsrw or and extra SHL/OR before vmovd is worse.

#if 0
// Zen 1 has slow vperm2i128, but I didn't end up using it even if it's fast
    __m128i hi = _mm256_extract_si128(v, 1); // vextracti128  - very cheap on Zen1
    __m128i lo = _mm256_castsi256_si128(v);  // no cost
    __m128i vleft = _mm_packs_epi16(lo, hi);  // vpacksswb signed saturation, high bit of each word becomes high bit of byte

    // then shift 2 halves separately and pack again?
#endif

Vector packing to set up for vpmovmskb is probably the best bet; before thinking of that, I was looking at using vpmovmskb on the input directly and using scalar bithacks to take odd or even bits:

But those take more operations, so they're slower unless you're bottlenecked on SIMD ALUs specifically, not overall front-end throughput (or execution-port throughput on Intel where SIMD and scalar ALUs share ports).

AVX-512 and/or GFNI

There are two interesting strategies here:

vpmovw2m and/or vptestmw or mb as a more convenient vpmovmskb. Only requires AVX-512BW (Skylake-avx512)
Pack 8 bits to the bottom of each qword, then shuffle. Probably only good with GFNI + AVX512VBMI, like Ice Lake / Zen4 and later. Maybe just GFNI + AVX2 as in crippled Alder Lake (no AVX-512).

Extracting bits to a mask:

With one vptestmb with set1_epi8(0x8001), we can get all the bits we want into one mask, but then we need to deinterleave, probably with scalar pext (which is fast on all AVX-512 CPUs except maybe Knight's Landing, but it doesn't have AVX-512BW).

So probably better to extract two masks and concatenate. Except wait a minute, I don't see a great way to get a 32-bit mask into a vector register (without expanding it to a vector of 0 / -1 elements). For 8 and 16-bit masks, there's mask-to-vector broadcasts like vpbroadcastmw2d x/y/zmm, k. They don't support masking, so you can't merge-mask into another register. That's single-uop on Zen 4, but on Intel it costs 2 uops, same as kmov eax, k / vpbroadcastd x/y/zmm, eax, which is what you should do instead so you can merge-mask into the vector with the top and bottom edges.

  vpmovw2m k1, ymm0                        # left = 16 mask bits from high bits of 16 elements
  vptestmw k2, ymm0, set1_epi16(0x0001)    # right.   pseudocode constant
  kunpckwd k1, k1, k2                      # left:right
     # there's no  vpbroadcastmd2d  only byte/word mask to dword or qword element!
  
    mov    ecx, 0b0010
    kmovb  k7, ecx            # hoist this constant setup out of loops.  If not looping, maybe do something else, like bcast to another register and vpblendd.

  kmovd    eax, k1
  vpbroadcastd xmm0{k7}, eax  # put left:right into the 2nd element of XMM0
                              # leaving other unchanged (merge-masking)

Where xmm0 could have been set by vpermq to have top:bottom in the low 16 bytes; all CPUs with AVX-512 have efficient vpermq. So that's 1 more uop on top of the 5 from my hand-written asm (which should be straightforward to write with intrinsics, I just didn't feel like taking the extra step of looking up the right intrinsics after finding the available asm instructions.)

Packing bits within qwords then shuffling: GFNI and probably AVX-512VBMI for `vpermb`

(Requiring AVX512VBMI means Ice Lake or Zen 4, so vpermb will be single-uop. Unless some future Intel CPU with an E-core supports a slower AVX-512, but still vpermb ymm hopefully wouldn't be too bad.)

Probably pack in left:right order (1 nibble each), then byte shuffle. If we can do left:right and right:left in alternating bytes, a byte shuffle (like vpermb or vpermt2b) should be able to set up for a vprolw to rotate within each 16-bit word to group 8 "left" bits in the right order.

Moving bits within a qword: Harold's answer on bitpack ascii string into 7-bit binary blob using SIMD shows _mm256_gf2p8affine_epi64_epi8 putting 1 bit from each byte at the top of each qword. (And packing the remaining 7-bit fields, which was the goal in that answer.)

If this is doable, it'll probably be fewer uops and significantly better latency than going to masks and back.

With Alder Lake (GFNI but AVX-512 disabled unless you manage to avoid Intel's efforts to cripple this amazing CPU), this might still be useful, since it has AVX+GFNI for _mm256_gf2p8affine_epi64_epi8. vpshufb + vpermd can substitute for vpermb. But you won't have word rotates; still, shuffling bytes like ABAB will let you use a plain left shift to get the window you wanted, and then shuffle again.

I have come to the conclusion that I am an utter newbie at AVX-style SIMD. I'm used to SAWR, which tends to have a rather different set of primitives. I'll have to digest this... there's a lot of useful info here. Thanks! — TLW, Dec 31 '22 at 19:13
And yes, `GF2P8AFFINEQB` in 'reverse' (constant 1st argument; input in 2nd argument) can be used to bit-transpose each qword (viewed as an 8x8 bitmatrix.) `GF2P8AFFINEQB` when available is great. — TLW, Dec 31 '22 at 19:21

Extracting edges of AVX2 16x16 bitmatrix

1 Answers1

Without fast BMI2 pext - saturating pack the vector to reduce to 8-bit elements

This might be faster even if pext is fast.