Probably your best bet for vectorization of this might be with vpackssdw
/ vpackuswb
, and vpermd
as a lane-crossing fixup after in-lane pack.
_mm256_srli_epi32
to shift the exponent (and sign bit) to the bottom in each 32-bit element. A logical shift leaves a non-negative result regardless of the sign bit.
- Then pack pairs of vectors down to 16-bit with
_mm256_packs_epi32
(signed input, signed saturation of output).
- Then mask off the sign bit, leaving an 8-bit exponent. We wait until now so we can do 16x
uint16_t
elements per instruction instead of 8x uint32_t
. Now you have 16-bit elements holding values that fit in uint8_t
without overflowing.
- Then pack pairs of vectors down to 8-bit with
_mm256_packus_epi16
(signed input, unsigned saturation of output). This actually matters, packs
would clip some valid values because your data uses the full range of uint8_t
.
- VPERMD to shuffle the eight 32-bit chunks of that vector that came from each lane of 4x 256-bit input vectors. Exactly the same
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
shuffle as in How to convert 32-bit float to 8-bit signed char?, which does the same pack after using FP->int conversion instead of right-shift to grab the exponent field.
Per result vector, you have 4x load+shift (vpsrld ymm,[mem]
hopefully), 2x vpackssdw
shuffles, 2x vpand
mask, 1x vpackuswb
, and 1x vpermd
. That's 4 shuffles, so the best we can hope for on Intel HSW/SKL is 1 result vector per 4 clocks. (Ryzen has better shuffle throughput, except for vpermd
which is expensive.)
But that should be achievable, so 32 bytes of input / 8 bytes of output per clock on average.
The 10 total vector ALU uops (including the micro-fused load+ALU), and the 1 store should be able to execute in that time. We have room for 16 total uops including loop overhead before the front-end becomes a worse bottleneck than shuffles.
update: oops, I forgot to count unbiasing the exponent; that will take an extra add
. But you can do that after packing down to 8-bit. (And optimize it to an XOR). I don't think we can optimize it away or into something else, like into masking away the sign bit.
With AVX512BW, you could do a byte-granularity vpaddb
to unbias, with zero-masking to zero the high byte of each pair. That would fold the unbiasing into the 16-bit masking.
AVX512F also has vpmovdb
32->8 bit truncation (without saturation), but only for single inputs. So you'd get one 64-bit or 128-bit result from one input 256 or 512-bit vector, with 1 shuffle + 1 add per input instead of 2+1 shuffles + 2 zero-masked vpaddb
per input vector. (Both need the right shift per input vector to align the 8-bit exponent field with a byte boundary at the bottom of a dword)
With AVX512VBMI, vpermt2b
would let us grab bytes from 2 input vectors. But it costs 2 uops on CannonLake, so only useful on hypothetical future CPUs if it gets cheaper. They can be the top byte of a dword, so we could start with vpaddd
a vector to itself to left-shift by 1. But we're probably best with a left-shift because the EVEX encoding of vpslld
or vpsrld
can take the data from memory with an immediate shift count, unlike the VEX encoding. So hopefully we get a single micro-fused load+shift uop to save front-end bandwidth.
The other option is to shift + blend, resulting in byte-interleaved results that are more expensive to fix up, unless you don't mind that order.
And byte-granularity blending (without AVX512BW) requires vpblendvb
which is 2 uops. (And on Haswell only runs on port 5, so potentially a huge bottleneck. On SKL it's 2 uops for any vector ALU port.)