1

I am looking for a way to permutate the 1 byte and/or 2 byte values in an __m256i register using AVX2 instructions. The solution needs to be able to able move values across 128-bit lanes.

I know that with AVX512 I could use _mm256_permutexvar_epi8 and _mm256_permutexvar_epi16 but I cant seem to find any generic solution with AVX2 for when the values need to go across lanes (if the values can stay within lane _mm256_shuffle_epi8 or _mm256_shuflehi_epi16(_mm256_shufflelo_epi16) works).

The permutation indices will be known at compile time.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Noah
  • 1,647
  • 1
  • 9
  • 18
  • 1
    You can't do it in one instruction, that's why `vpermb` requires AVX512VBMI. [Where is VPERMB in AVX2?](https://stackoverflow.com/q/37980209). If you need a fully-general thing that works for any runtime-variable vector, you'll have to emulate it with maybe 2x `_mm256_shuffle_epi8` and a blend, or something like that. (You'd have to lane-swap the input for one `vpshufb` with `vpermq` or something, so that's at least 3 shuffles). Otherwise hopefully you can do something more efficient. – Peter Cordes Oct 11 '20 at 08:06
  • Any improvements if the permutation vector is known at compile time (but could be anything)? – Noah Oct 11 '20 at 08:09
  • 1
    `_mm256_shuflehi_epi16` is `vpshufhw`, and only works for immediate constants. You can't emulate runtime-variable byte shuffles with it. If your shuffle is a compile-time constant, again you should be looking at doing something more clever. **Agner Fog's VCL has some template metaprogramming to try to find efficient ways to to implement arbitrary shuffles.** https://github.com/vectorclass/version2 – Peter Cordes Oct 11 '20 at 08:09
  • I see. Should have checked that earlier. Thank you! – Noah Oct 11 '20 at 08:11
  • 1
    Since the permutation in known at compile time, if you're using GCC or clang you can use `__builtin_shuffle` (GCC) or `__builtin_shufflevector` (clang). They generally do a very good job selecting the best instructions. – nemequ Oct 12 '20 at 02:52
  • @nemequ this is great. Clang has this really optimized (gcc appears to be missing a few cases). – Noah Oct 12 '20 at 20:31
  • 1
    @Noah, if you want an abstraction that works on both compilers, feel free to steal https://github.com/simd-everywhere/simde/blob/master/simde/simde-common.h#L302-L326 (the file is MIT, but if you need it under a different license let me know; that macro is exclusively my fault so I can give you a license under other terms). – nemequ Oct 12 '20 at 20:53

0 Answers0