If I have a __m256i
vector containing 32 unsigned 8-bit integers, how can I most efficiently unpack and cast that so I get four __m256
vectors, each containing eight 32-bit float
numbers?
I suppose that, once I have them in 32-bit signed integer form, I can cast them to floats via _mm256_cvtepi32_ps
so the question probably boils down to how I can most efficiently go from the 8-bit unsigned integer (epu8
) representation to the signed 32-bit signed integer (epi32
) representation.
There exists _mm256_cvtepu8_epi32(__m128i a)
but that only seems to work on the lower (64-bit) half of a __m128i
input, whereas I have a __m256i
input.
Is there a better way than turning my __m256i
input into four __m128i
vectors via two calls to _mm256_extracti128_si256(__m256i a, const int imm8)
, then somehow swapping the upper and lower (64-bit) halves of those __m128i
vectors (for a total of four __m128i
vectors, each of which has a different 64-bit quarter of the initial __m256i
vector in its bottom half), and then doing _mm256_cvtepu8_epi32(__m128i a)
, followed by _mm256_cvtepi32_ps(__m256i a)
on each of them?
Seems pretty messy and I'm wondering if there's a better way. I'm entirely new to vector intrinsics so I'm surely missing something here.
EDIT for more context:
So the setup is that have three pairs of arrays, R1
, G1
, B1
and R2
, G2
, B2
of uint8_t
pixel values and the computation to be done is the sum of channel-wise squared differences, i.e. square(R1 - R2) + square(G1 - G2) + square(B1 - B2)
. The differences are currently performed vectorised in uint8_t
form max(R1, R2) - min(R1, R2)
(etc.), such that 32 uint8_t
differences can be computed at a time in a single _mm256_sub_epi8
. My question kicks in after I've obtained these differences R_diff
, G_diff
and B_diff
and before squaring them, for which 8-bit integers are too small.