In SIMD, If I have a simple algorithm written for 128-bit vectors like:
__m128 add_128(__m128 a, __m128 b) {
return _mm_add_ps(a, b);
}
All I have to do to make this work for 256-bit vectors is change the width to 256-bit and it works no questions asked:
__m256 add_256(__m256 a, __m256 b) {
return _mm256_add_ps(a, b);
}
I have an algorithm that uses _mm_packs_epi32 and _mm_unpacklo_epi16 to pack/unpack data, but merely switching to the 256-bit variants does not produce correct outputs.
The 128-bit variants:
__m128i pack_128(__m128i a, __m128i b) {
return _mm_packs_epi32(a, b);
}
__m128i unpack_128(__m128i a, __m128i b) {
return _mm_unpacklo_epi16(a, b);
}
The 256-bit variants that do not produce expected outputs:
__m256i pack_256(__m256i a, __m256i b) {
return _mm256_packs_epi32(a, b);
}
__m256i unpack_256(__m256i a, __m256i b) {
return _mm256_unpacklo_epi16(a, b);
}
For my 256-bit versions, I ended up extracting my inputs in 2 128-bit vectors each with _mm256_extracti128_si256
and executing the 128-bit intrinsic on them, then sticking each part back up into one 256-bit vector with _mm256_set_m128
. This is what I expect for the 256-bit versions to do:
__m256i pack_256(__m256i a, __m256i b) {
__m128i alo = _mm256_extracti128_si256(a, 0);
__m128i ahi = _mm256_extracti128_si256(a, 1);
__m128i blo = _mm256_extracti128_si256(b, 0);
__m128i bhi = _mm256_extracti128_si256(b, 1);
// Do the exact same behavior, but twice on the full 256-bit vector.
__m128i reslo = _mm_packs_epi32(alo, blo);
__m128i reshi = _mm_packs_epi32(ahi, bi);
return _mm256_set_m128(reshi, reslo);
}
I'm sure there's a better way to do this but I don't understand this enough to figure it out. What is special about the packing instructions that their logic does not scale up as I expect?