How can I most efficiently convert an m256i vector containing 32 unsigned 8-bit integers to four m256 vectors of 32-bit floats?

Question

If I have a __m256i vector containing 32 unsigned 8-bit integers, how can I most efficiently unpack and cast that so I get four __m256 vectors, each containing eight 32-bit float numbers?

I suppose that, once I have them in 32-bit signed integer form, I can cast them to floats via _mm256_cvtepi32_ps so the question probably boils down to how I can most efficiently go from the 8-bit unsigned integer (epu8) representation to the signed 32-bit signed integer (epi32) representation.

There exists _mm256_cvtepu8_epi32(__m128i a) but that only seems to work on the lower (64-bit) half of a __m128i input, whereas I have a __m256i input.

Is there a better way than turning my __m256i input into four __m128i vectors via two calls to _mm256_extracti128_si256(__m256i a, const int imm8), then somehow swapping the upper and lower (64-bit) halves of those __m128i vectors (for a total of four __m128i vectors, each of which has a different 64-bit quarter of the initial __m256i vector in its bottom half), and then doing _mm256_cvtepu8_epi32(__m128i a), followed by _mm256_cvtepi32_ps(__m256i a) on each of them?

Seems pretty messy and I'm wondering if there's a better way. I'm entirely new to vector intrinsics so I'm surely missing something here.

EDIT for more context:

So the setup is that have three pairs of arrays, R1, G1, B1 and R2, G2, B2 of uint8_t pixel values and the computation to be done is the sum of channel-wise squared differences, i.e. square(R1 - R2) + square(G1 - G2) + square(B1 - B2). The differences are currently performed vectorised in uint8_t form max(R1, R2) - min(R1, R2) (etc.), such that 32 uint8_t differences can be computed at a time in a single _mm256_sub_epi8. My question kicks in after I've obtained these differences R_diff, G_diff and B_diff and before squaring them, for which 8-bit integers are too small.

If that is exactly what you want, you likely can't do much better than that 3 shuffles, and 4 `cvtepu8` + 4 `cvtepi32_ps`. If you tell a bit context (where does your input vector come from, and what do you do with the output vectors? Does the order of the output matter?), then there might be room for optimization. — chtz, May 11 '23 at 17:25
A float is 4x wider than a `u8`, so you only need 4 vectors (of eight floats each), not 8 vectors. — Peter Cordes, May 11 '23 at 17:26
Yes, the trick is just in how you shuffle. Does your data come from memory originally? If so you could just load it differently to set up for `vpmovzxbd`. If your data is already in a register after a computation, a mix of shuffle and store/reload is possible, e.g. `vextracti128` to a 16-byte stack buffer, then two `vpmovzxbd` reloads? Or a broadcast load to set up for different `vpshufb` shuffles. If you had AVX-512FP16 (Sapphire Rapids), there might be something to gain from unpacking to 16-bit integers for conversion, and then F16 to F32 convert+shuffle. But maybe not. — Peter Cordes, May 11 '23 at 17:31
Are you doing a lot of these conversions in a loop over many vectors of data? If so, you can amortize the cost of loading some vector constants and get the high half unpacked with one `vpermq` to feed two in-lane `vpshufb` shuffles. (Or `vpmovzbd` + `vpshufb`). So a total of maybe 1x `vpmovzxbd` + `vpermq` + `vpmovzxbd` + `vpshufb`, and one more `vextracti128` (perhaps to memory) + `vpmovzxbd` (from memory). This is assuming you want the output vectors to be in order; as chtz says it's cheaper to just mask and/or shift 4 different ways to get vectors of every 4th float. And cheap repack. — Peter Cordes, May 11 '23 at 17:48
Related: [Loading 8 chars from memory into an \_\_m256 variable as packed single precision floats](https://stackoverflow.com/q/34279513) - loading data from memory, some discussion of shuffling a wider vector leading to bottlenecks on shuffle execution-unit throughput. And [SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers](https://stackoverflow.com/q/29856006) (the reverse, packing i32 conversion results down to u8.) And [How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 \_\_m256i)](https://stackoverflow.com/q/51778721) for AVX packing. — Peter Cordes, May 11 '23 at 17:56
So the setup is that have three pairs of arrays, `R1`, `G1`, `B1` and `R2`, `G2`, `B2` of `uint8_t` pixel values and the computation to be done is the sum of channel-wise squared differences, i.e. `square(R1 - R2) + square(G1 - G2) + square(B1 - B2)`. The differences are currently performed vectorised in `uint8_t` form `max(R1, R2) - min(R1, R2)` (etc.), such that 32 `uint8_t` differences can be computed in a single `_mm256_sub_epi8`. My question kicks in after I've obtained these differences `R_diff`, `G_diff` and `B_diff` and before squaring them, for which 8bit integers are too small. — Mandelmus100, May 11 '23 at 22:05
Why would you want to convert to float for that? Widen to 16-bit for `pmaddwd` to do integer multiply and add pairs horizontally into 32-bit integer elements. Or if the range is limited enough, use `pmaddubsw` to do 8x8 => 16-bit signed x unsigned multiplies. https://www.felixcloutier.com/x86/pmaddubsw If necessary, deinterleave your data so you have an R next to an R and so on - planar data is easier to work with than packed. — Peter Cordes, May 11 '23 at 22:48
To group data you want together, you might just blend instead of shuffling, since you're summing. Unfortunately we don't have 3-input blends until AVX512, so maybe shuffle within each input separately to group the Rs, the Gs, and the Bs or something. — Peter Cordes, May 11 '23 at 23:32
Oh, three pairs of arrays, so your data is already planar. After max()-min(), probably just mask odd/even pairs with `_mm256_and_si256(red_diff, _mm256_set1_epi16(0x00ff))` and `_mm256_srli_epi16(red_diff, 8)`, then `_mm256_madd_epi16(red_odd, red_odd)` and so on, accumulating into 32-bit integer elements for `_mm256_add_epi32`. (If that might overflow over a huge array, then use an outer loop that converts to uint64_t or float32, every 2^16 vectors. The product of two 8-bit integers is less than 2^16, so your u32 elements have room for 2^16 sums.) — Peter Cordes, May 12 '23 at 00:34

How can I most efficiently convert an __m256i vector containing 32 unsigned 8-bit integers to four __m256 vectors of 32-bit floats?

0 Answers0

How can I most efficiently convert an m256i vector containing 32 unsigned 8-bit integers to four m256 vectors of 32-bit floats?