How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?

Question

I need to optimize the following compression operation (on a server with AVX2 instructions available):

take the exponents of an array of floats, shift and store to a uint8_t array

I have little experience and was suggested to start with https://github.com/feltor-dev/vcl library

now that I have

uint8_t* uin8_t_ptr = ...;
float* float_ptr = ...;
float* final_ptr = float_ptr + offset;

for (; float_ptr < final_ptr; float_ptr+=8) {
    Vec8f vec_f = Vec8f().load(float_ptr);
    Vec8i vec_i = fraction(vec_f) + 128; // range: 0~255
    ...
}

My question is how to efficiently store the vec_i results to the uint8_t array?

I couldn't find relevant functions in the vcl library and was trying to explore the intrinsic instructions since I could access the __m256i data.

My current understanding is to use something like _mm256_shuffle_epi8, but don't know the best way to do it efficiently.

I wonder if trying to fully utilize the bits and store 32 elements every time (using a loop with float_ptr+=32) would be the way to go.

Any suggestions are welcome. Thanks.

Does your compiler effectively vectorize for you if you ask it nicely? If so, I would start here as it will teach you all sorts of tricks and may even do the job good enough that you won't need to do it yourself. — Michael Dorgan, Apr 24 '19 at 22:34
Thanks for your comment @michael-dorgan, I'm not sure how to proceed with this, though. Can you maybe share some reference/tutorial? I'm using gcc 5.4.0. — Elton, Apr 25 '19 at 10:33

Peter Cordes · Accepted Answer · 2019-05-17T14:22:44.203

Probably your best bet for vectorization of this might be with vpackssdw / vpackuswb, and vpermd as a lane-crossing fixup after in-lane pack.

_mm256_srli_epi32 to shift the exponent (and sign bit) to the bottom in each 32-bit element. A logical shift leaves a non-negative result regardless of the sign bit.
Then pack pairs of vectors down to 16-bit with _mm256_packs_epi32 (signed input, signed saturation of output).
Then mask off the sign bit, leaving an 8-bit exponent. We wait until now so we can do 16x uint16_t elements per instruction instead of 8x uint32_t. Now you have 16-bit elements holding values that fit in uint8_t without overflowing.
Then pack pairs of vectors down to 8-bit with _mm256_packus_epi16 (signed input, unsigned saturation of output). This actually matters, packs would clip some valid values because your data uses the full range of uint8_t.
VPERMD to shuffle the eight 32-bit chunks of that vector that came from each lane of 4x 256-bit input vectors. Exactly the same __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7)); shuffle as in How to convert 32-bit float to 8-bit signed char?, which does the same pack after using FP->int conversion instead of right-shift to grab the exponent field.

Per result vector, you have 4x load+shift (vpsrld ymm,[mem] hopefully), 2x vpackssdw shuffles, 2x vpand mask, 1x vpackuswb, and 1x vpermd. That's 4 shuffles, so the best we can hope for on Intel HSW/SKL is 1 result vector per 4 clocks. (Ryzen has better shuffle throughput, except for vpermd which is expensive.)

But that should be achievable, so 32 bytes of input / 8 bytes of output per clock on average.

The 10 total vector ALU uops (including the micro-fused load+ALU), and the 1 store should be able to execute in that time. We have room for 16 total uops including loop overhead before the front-end becomes a worse bottleneck than shuffles.

update: oops, I forgot to count unbiasing the exponent; that will take an extra add. But you can do that after packing down to 8-bit. (And optimize it to an XOR). I don't think we can optimize it away or into something else, like into masking away the sign bit.

With AVX512BW, you could do a byte-granularity vpaddb to unbias, with zero-masking to zero the high byte of each pair. That would fold the unbiasing into the 16-bit masking.

AVX512F also has vpmovdb 32->8 bit truncation (without saturation), but only for single inputs. So you'd get one 64-bit or 128-bit result from one input 256 or 512-bit vector, with 1 shuffle + 1 add per input instead of 2+1 shuffles + 2 zero-masked vpaddb per input vector. (Both need the right shift per input vector to align the 8-bit exponent field with a byte boundary at the bottom of a dword)

With AVX512VBMI, vpermt2b would let us grab bytes from 2 input vectors. But it costs 2 uops on CannonLake, so only useful on hypothetical future CPUs if it gets cheaper. They can be the top byte of a dword, so we could start with vpaddd a vector to itself to left-shift by 1. But we're probably best with a left-shift because the EVEX encoding of vpslld or vpsrld can take the data from memory with an immediate shift count, unlike the VEX encoding. So hopefully we get a single micro-fused load+shift uop to save front-end bandwidth.

The other option is to shift + blend, resulting in byte-interleaved results that are more expensive to fix up, unless you don't mind that order.

And byte-granularity blending (without AVX512BW) requires vpblendvb which is 2 uops. (And on Haswell only runs on port 5, so potentially a huge bottleneck. On SKL it's 2 uops for any vector ALU port.)

Thank you, Peter. I'll need some digestion since I just discovered this new world of intrinsic instructions. — Elton, Apr 25 '19 at 08:33

How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?

1 Answers1

Linked