With fast BMI2 pext
(Haswell or Zen 3 and later), that's one option if you start with vpmovmskb
+ shift + vpmovmskb
to get the bits of the edges (interleaved with garbage bits, since we want every 16th but we get every 8th).
9 uops for the front-end, 6 of them needing port 5 on Intel Skylake-family. (Not counting the integer constant setup, assuming you'd do this in a loop. If not, that also counts against this.)
__m128i edges_zen3_intel(__m256i v)
{
__m128i vtop_bottom = _mm256_castsi256_si128(
_mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
// vpermq: 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
// side bits interleaved with garbage
// without AVX-512 we can only extract a bit per byte, dword, or qword
unsigned left = _mm256_movemask_epi8(v); // high bit of each element
unsigned right = _mm256_movemask_epi8( _mm256_slli_epi16(v, 15) ); // low<<15
// left = _pext_u32(left, 0xAAAAAAAAul); // take every other bit starting with #1
// right = _pext_u32(right, 0xAAAAAAAAul);
// then combine or do whatever
uint64_t lr = ((uint64_t)left << 32) | right;
lr = _pext_u64(lr, 0xAAAAAAAAAAAAAAAAull);
//__m128i vsides = _mm_cvtsi32_si128(lr);
__m128i vtblr = _mm_insert_epi32(vtop_bottom, lr, 1); // into an unused space
// u16 elems: [ top | x | x | x | left | right | x | bottom ]
return vtblr;
}
This compiles to 10 uops for Intel CPUs (and Zen 4), including getting everything back into one SIMD vector. The movabs
can be hoisted out of loops. SHL/OR don't compete for SIMD execution-port throughput (able to run on port 6 on Intel), but do compete for the front-end. Godbolt
# Haswell/Sklake uop counts
edges_zen3_intel(long long __vector(4)):
vpsllw ymm2, ymm0, 15 # p0 (or p01 on Skylake)
vpmovmskb eax, ymm0 # p0
vpermq ymm1, ymm0, 12 # p5
vpmovmskb edx, ymm2 # p0
sal rax, 32 # p06
or rax, rdx # p0156
movabs rdx, -6148914691236517206 # p0156 (and can be hoisted out of loops)
pext rax, rax, rdx # p1
vpinsrd xmm0, xmm1, eax, 1 # 2 p5. On Intel, both uops compete with shuffles
ret
As a variation, we could maybe get left and right edges together for one vpmovmskb
, if we can left-shift the odd bytes but not the evens? Probably not, _mm256_maddubs_epi16
with _mm256_set1_epi16(0x0180)
can't do that, it adds horizontal pairs, and a left-shift of 7 (0x80 = 1<<7) isn't enough, we'd need 8 to get the top bit back to the top.
Or if we vpsllw
+ vpacksswb
, then use the right masks to group bits, like 0x00ff00ff
. But that's getting closer to my non-pext idea, maybe it's better even if we do have fast pext
Without fast BMI2 pext
- saturating pack the vector to reduce to 8-bit elements
This might be faster even if pext
is fast.
Packing with signed saturation always preserves the sign bit, so you can narrow 16 to 8-bit without losing the information you want to keep. We want to do this to the high and low bit of each word (16-bit element), so a 2:1 pack with the original and v<<15
is perfect.
Except for the fact that AVX2 vpacksswb ymm
is two separate in-lane pack operations, so we end up with 8-element chunks interleaved. We could fix that up right after packing with vpermq
, but it's multiple uops on Zen 1 through Zen 3, and we can instead shuffle bytes after getting the movemask
result back into a vector register. (The same vpshufb
can move around the high and low elements.)
// avoiding PEXT because it's slow on Zen 2 and Zen 1 (and Excavator)
// This might be good on Intel and Zen 3, maybe comparable to using PEXT
__m128i edges_no_pext(__m256i v)
{
__m128i vhi = _mm256_extract_si128(v, 1); // contains top, as vhi.u16[7]
__m128i vlo = _mm256_castsi256_si128(v); // contains bottom, as vlo.u16[0], contiguous if concatenated the right way
__m128i bottom_top = _mm_alignr_epi8(vhi, vlo, 12); // rotate bottom :top down to the 2nd dword [ x | x | bottom:top | x]
// vpermq ymm, ymm, imm would also work to get them into the low 128
// but that's 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
// and would need a slightly more expensive vpinsrd instead of vmovd+vpblendd
// On Intel CPUs (and Zen4) vpermq is better; we pshufb later so we can get the bytes where we want them.
// A compromise is to use vextracti128+vpblendd here, vpinsrd later
// __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);
// [ hi | x | x | x | x | x | x | lo ]
__m256i vright = _mm256_slli_epi16(v, 15);
__m256i vpacked = _mm256_packs_epi16(v, vright); // pack now, shuffle bytes later.
unsigned bits = _mm256_extract_epi8(vpacked); // [ left_hi | right_hi | left_lo | right_lo ]
__m128i vsides = _mm_cvtsi32_si128(bits);
__m128i vtblr = _mm_blend_epi32(top_bottom, vsides, 0b0001); // vpinsrd xmm0, eax, 0 but the merge can run on more ports
__m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
7,6,5,4, 3,1, 2,0);
// swap middle 2 bytes of the low dword, fixing up the in-lane pack
vtblr = _mm_shuffle_epi8(vtblr, shuffle);
return vtblr; // low 4 u16 elements are (MSB) top | bottom | left | right (LSB)
}
This compiles pretty nicely (see earlier Godbolt link), although GCC4.9 and later (and clang) pessimize my vmovd
+vpblendd
into vpinsrd
, even with -march=haswell
or Skylake where it's 2 uops for port 5 (https://uops.info/) when most of the other instructions in the function are also shuffles that only run on port 5. (This is much more shuffle-heavy for Intel CPUs.)
Using vpblendd
instead of vpalignr
would make it less bad for Intel, like __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);
, to get to the same situation as in the vpermq
version below with 2 uops even on Zen 1. But this is just saving 1 uop on Zen 1 and is equal or worse everywhere else.
# GCC12 -O3 -march=haswell
# uop counts for Skylake
edges_no_pext:
vextracti128 xmm1, ymm0, 0x1 # p5
vpsllw ymm2, ymm0, 15 # p01
vpalignr xmm1, xmm1, xmm0, 12 # p5
vpacksswb ymm0, ymm0, ymm2 # p5
vpmovmskb eax, ymm0 # p0
vpinsrd xmm0, xmm1, eax, 0 # 2 p5
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip] # p5
ret
So that's 6 uops for port 5 on Intel, a throughput bottleneck of 1 per 6 cycles. vs. the PEXT version being 3 uops that need port 0, 3 that need port 5. But this is only 8 total uops for the front-end, vs. 9 for the pext
version. And the vpermq
version saves one more on Intel, assuming GCC doesn't waste the vmovdqa
after inlining.
If you didn't care about zeroing the upper 8 bytes of the output vector, the shuffle constant could be loaded with vmovq
and just be 8 bytes instead of 16 (if you made the upper 0 bytes all zeros). But compilers will probably not spot that optimization.
Since compilers insist on pessimizing to vpinsrd
, on CPUs with fast vpermq
(Intel and Zen4), we might as well use that:
If you're only going to have one non-GFNI AVX2 version, this is probably a good tradeoff
vpermq
being 3 uops on Zen 1 isn't much worse than emulating what we need from it using 2 instruction, and is worse on Intel CPUs. And probably about break-even on Zen 2 and Zen 3, modulo differences in back-end port usage.
// for fast vpermq, especially if compilers are going to pessimize vmovd(p5)+vpblendd (p015) into vpinsrd (2p5).
// good on Intel and Zen 4, maybe also Zen 3 and not bad on Zen 2.
__m128i edges_no_pext_fast_vpermq(__m256i v)
{
__m128i vtop_bottom = _mm256_castsi256_si128(
_mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
// 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
__m256i vright = _mm256_slli_epi16(v, 15);
__m256i vpacked = _mm256_packs_epi16(v, vright); // pack now, shuffle bytes later.
unsigned bits = _mm256_movemask_epi8(vpacked); // [ left_hi | right_hi | left_lo | right_lo ]
__m128i vtblr = _mm_insert_epi32(vtop_bottom, bits, 1); // into an unused space
// u16 elems: [ top | x | x | x | lh:rh | ll:rl | x | bottom ]
__m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
15,14, 1,0, 7,5, 6,4);
vtblr = _mm_shuffle_epi8(vtblr, shuffle);
return vtblr; // low 4 u16 elements are (MSB) top | bottom | left | right (LSB)
}
# GCC12.2 -O3 -march=haswell clang is similar but has vzeroupper despite the caller passing a YMM, but no wasted vmovdqa
edges_no_pext_fast_vpermq(long long __vector(4)):
vmovdqa ymm1, ymm0
vpermq ymm0, ymm0, 12
vpsllw ymm2, ymm1, 15
vpacksswb ymm1, ymm1, ymm2
vpmovmskb eax, ymm1
vpinsrd xmm0, xmm0, eax, 1
vpshufb xmm0, xmm0, XMMWORD PTR .LC1[rip]
ret
On Intel Haswell/Skylake, this is 5 uops for port 5, plus a shift (p01) and vpmovmskb (p0). So 7 total uops. (Not counting the ret or the wasted vmovdqa
that should go away with inlining.)
On Ice Lake and later, one of the uops from vpinsrd
can run on p15, relieving one uop of pressure on that port if you're doing this in a loop. vpinsrd
is single-uop on Alder Lake E-cores.
Ice Lake (and later) can also run vpshufb
on p1/p5, further reducing port 5 pressure, down to 3 of the 7 uops. Port 5 can handle any shuffle, port 1 can handle some but not all shuffle uops. It may be hooked up to the upper half of the 512-bit shuffle unit to give extra throughput for some 256-bit and narrower shuffles, like how the p0/p1 FMA units work as a single 512-bit FMA unit on p0. It doesn't handle vpermq
or vpacksswb
; those are still p5 only on Ice/Alder Lake.
So this version is pretty reasonable on current-generation and future Intel CPUs. Alder Lake E-cores run vpermq ymm
as 2 uops with 7 cycle latency. But if they can hide that latency with their more limited out-of-order scheduling (big ROB, but queues for each port aren't as long), running vpinsrd
as a single uop helps make up the front-end throughput.
256-bit instructions like vpsllw ymm
and vpacksswb ymm
are also 2 uops each on Alder Lake E-cores, but vpmovmskb eax,ymm
is 1 uop (but maybe high-ish latency). So even if we wanted to make a version optimized for Zen1 / Alder E, we probably can't save total uops on them by using more 128-bit instructions after vextracti128
; we still need to do stuff to both halves of the input vector.
I had looked at packing into the right order for vpmovmskb xmm
to get each 16-bit group in the right order, but separately. I had considered doing this with vperm2i128
, but that's quite slow on Zen 1.
// __m256i vcombined = _mm256_permute2x128_si256(v, vright, 0x10); // or something? Takes two shuffles to get them ordered the right way for pack
Zen 1 has very fast vextracti128
- is single-uop for any port, and 128-bit vector ops are 1 uop vs. 2 for __m256i
operations. And where we're already doing that extract to get the top and bottom together.
But it still leads to more scalar work, especially if you want the result combined in a vector. 2x vpinsrw
or and extra SHL/OR before vmovd
is worse.
#if 0
// Zen 1 has slow vperm2i128, but I didn't end up using it even if it's fast
__m128i hi = _mm256_extract_si128(v, 1); // vextracti128 - very cheap on Zen1
__m128i lo = _mm256_castsi256_si128(v); // no cost
__m128i vleft = _mm_packs_epi16(lo, hi); // vpacksswb signed saturation, high bit of each word becomes high bit of byte
// then shift 2 halves separately and pack again?
#endif
Vector packing to set up for vpmovmskb
is probably the best bet; before thinking of that, I was looking at using vpmovmskb
on the input directly and using scalar bithacks to take odd or even bits:
But those take more operations, so they're slower unless you're bottlenecked on SIMD ALUs specifically, not overall front-end throughput (or execution-port throughput on Intel where SIMD and scalar ALUs share ports).
AVX-512 and/or GFNI
There are two interesting strategies here:
vpmovw2m
and/or vptestmw
or mb
as a more convenient vpmovmskb
. Only requires AVX-512BW (Skylake-avx512)
- Pack 8 bits to the bottom of each qword, then shuffle. Probably only good with GFNI + AVX512VBMI, like Ice Lake / Zen4 and later. Maybe just GFNI + AVX2 as in crippled Alder Lake (no AVX-512).
Extracting bits to a mask:
With one vptestmb
with set1_epi8(0x8001)
, we can get all the bits we want into one mask, but then we need to deinterleave, probably with scalar pext
(which is fast on all AVX-512 CPUs except maybe Knight's Landing, but it doesn't have AVX-512BW).
So probably better to extract two masks and concatenate. Except wait a minute, I don't see a great way to get a 32-bit mask into a vector register (without expanding it to a vector of 0 / -1 elements). For 8 and 16-bit masks, there's mask-to-vector broadcasts like vpbroadcastmw2d x/y/zmm, k
. They don't support masking, so you can't merge-mask into another register. That's single-uop on Zen 4, but on Intel it costs 2 uops, same as kmov eax, k
/ vpbroadcastd x/y/zmm, eax
, which is what you should do instead so you can merge-mask into the vector with the top and bottom edges.
vpmovw2m k1, ymm0 # left = 16 mask bits from high bits of 16 elements
vptestmw k2, ymm0, set1_epi16(0x0001) # right. pseudocode constant
kunpckwd k1, k1, k2 # left:right
# there's no vpbroadcastmd2d only byte/word mask to dword or qword element!
mov ecx, 0b0010
kmovb k7, ecx # hoist this constant setup out of loops. If not looping, maybe do something else, like bcast to another register and vpblendd.
kmovd eax, k1
vpbroadcastd xmm0{k7}, eax # put left:right into the 2nd element of XMM0
# leaving other unchanged (merge-masking)
Where xmm0 could have been set by vpermq
to have top:bottom in the low 16 bytes; all CPUs with AVX-512 have efficient vpermq
. So that's 1 more uop on top of the 5 from my hand-written asm (which should be straightforward to write with intrinsics, I just didn't feel like taking the extra step of looking up the right intrinsics after finding the available asm instructions.)
Packing bits within qwords then shuffling: GFNI and probably AVX-512VBMI for vpermb
(Requiring AVX512VBMI means Ice Lake or Zen 4, so vpermb
will be single-uop. Unless some future Intel CPU with an E-core supports a slower AVX-512, but still vpermb ymm
hopefully wouldn't be too bad.)
Probably pack in left:right order (1 nibble each), then byte shuffle. If we can do left:right
and right:left
in alternating bytes, a byte shuffle (like vpermb
or vpermt2b
) should be able to set up for a vprolw
to rotate within each 16-bit word to group 8 "left" bits in the right order.
Moving bits within a qword: Harold's answer on bitpack ascii string into 7-bit binary blob using SIMD shows _mm256_gf2p8affine_epi64_epi8
putting 1 bit from each byte at the top of each qword. (And packing the remaining 7-bit fields, which was the goal in that answer.)
If this is doable, it'll probably be fewer uops and significantly better latency than going to masks and back.
With Alder Lake (GFNI but AVX-512 disabled unless you manage to avoid Intel's efforts to cripple this amazing CPU), this might still be useful, since it has AVX+GFNI for _mm256_gf2p8affine_epi64_epi8
. vpshufb
+ vpermd
can substitute for vpermb
. But you won't have word rotates; still, shuffling bytes like ABAB will let you use a plain left shift to get the window you wanted, and then shuffle again.