0

In SIMD, If I have a simple algorithm written for 128-bit vectors like:

__m128 add_128(__m128 a, __m128 b) {
    return _mm_add_ps(a, b);
}

All I have to do to make this work for 256-bit vectors is change the width to 256-bit and it works no questions asked:

__m256 add_256(__m256 a, __m256 b) {
    return _mm256_add_ps(a, b);
}

I have an algorithm that uses _mm_packs_epi32 and _mm_unpacklo_epi16 to pack/unpack data, but merely switching to the 256-bit variants does not produce correct outputs.

The 128-bit variants:

__m128i pack_128(__m128i a, __m128i b) {
    return _mm_packs_epi32(a, b);
}

__m128i unpack_128(__m128i a, __m128i b) {
    return _mm_unpacklo_epi16(a, b);
}

The 256-bit variants that do not produce expected outputs:

__m256i pack_256(__m256i a, __m256i b) {
    return _mm256_packs_epi32(a, b);
}

__m256i unpack_256(__m256i a, __m256i b) {
    return _mm256_unpacklo_epi16(a, b);
}

For my 256-bit versions, I ended up extracting my inputs in 2 128-bit vectors each with _mm256_extracti128_si256 and executing the 128-bit intrinsic on them, then sticking each part back up into one 256-bit vector with _mm256_set_m128. This is what I expect for the 256-bit versions to do:

__m256i pack_256(__m256i a, __m256i b) {

    __m128i alo = _mm256_extracti128_si256(a, 0);
    __m128i ahi = _mm256_extracti128_si256(a, 1);
    __m128i blo = _mm256_extracti128_si256(b, 0);
    __m128i bhi = _mm256_extracti128_si256(b, 1);

    // Do the exact same behavior, but twice on the full 256-bit vector.
    __m128i reslo = _mm_packs_epi32(alo, blo);
    __m128i reshi = _mm_packs_epi32(ahi, bi);

    return _mm256_set_m128(reshi, reslo);

}

I'm sure there's a better way to do this but I don't understand this enough to figure it out. What is special about the packing instructions that their logic does not scale up as I expect?

aganm
  • 1,245
  • 1
  • 11
  • 30
  • AVX2 extended all existing instructions by making them do the same 128-bit operation twice in the two 128-bit lanes. This is inconvenient for a lot of shuffles. Very inconvenient for pack/unpack. Near-useless for `vpalignr`. It's not until AVX-512 that we get better shuffles (lane-crossing 2-input like `vpermt2d` and 1-input narrowing), but even then we don't get saturating 2-input pack, only 1-input like `vpmovqd` and the U or S saturating variants: https://www.felixcloutier.com/x86/vpmovqd:vpmovsqd:vpmovusqd) – Peter Cordes Jul 29 '23 at 10:12
  • One trick is to use `_mm256_packs_epi32` and then run a shuffle like `vpermq` to rearrange the results. It doesn't need 16-bit granularity because the elements you want are grouped into contiguous 64-bit chunks from narrowing each lane of the two input `__m256i` vectors. See [What is the inverse of "\_mm256\_cvtepi16\_epi32"](https://stackoverflow.com/q/49721807) (Also related re AVX2 shuffles being dumb: [unexpected \_mm256\_shuffle\_epi with \_\_256i vectors](https://stackoverflow.com/q/46582438) – Peter Cordes Jul 29 '23 at 10:14
  • 1
    @PeterCordes Oh my x86, it works! This makes total sense now that you told me why the instructions behave like that. Your advice to use vpermq worked perfectly. I put a `vpermq` right after packing with `_mm_packs_epi32` and I put a `vpermq` right before unpacking with `_mm_unpacklo_epi16` and the code behaves like I expected it to. I will close this question since it seems to be duplicate. Thanks a lot! – aganm Jul 29 '23 at 10:42

0 Answers0