_mm256_permutexvar_epi8 and _mm256_permutexvar_epi16 AVX2 equivalents for compile-time-constant shuffles?

Question

I am looking for a way to permutate the 1 byte and/or 2 byte values in an __m256i register using AVX2 instructions. The solution needs to be able to able move values across 128-bit lanes.

I know that with AVX512 I could use _mm256_permutexvar_epi8 and _mm256_permutexvar_epi16 but I cant seem to find any generic solution with AVX2 for when the values need to go across lanes (if the values can stay within lane _mm256_shuffle_epi8 or _mm256_shuflehi_epi16(_mm256_shufflelo_epi16) works).

The permutation indices will be known at compile time.

You can't do it in one instruction, that's why `vpermb` requires AVX512VBMI. [Where is VPERMB in AVX2?](https://stackoverflow.com/q/37980209). If you need a fully-general thing that works for any runtime-variable vector, you'll have to emulate it with maybe 2x `_mm256_shuffle_epi8` and a blend, or something like that. (You'd have to lane-swap the input for one `vpshufb` with `vpermq` or something, so that's at least 3 shuffles). Otherwise hopefully you can do something more efficient. — Peter Cordes, Oct 11 '20 at 08:06
Any improvements if the permutation vector is known at compile time (but could be anything)? — Noah, Oct 11 '20 at 08:09
`_mm256_shuflehi_epi16` is `vpshufhw`, and only works for immediate constants. You can't emulate runtime-variable byte shuffles with it. If your shuffle is a compile-time constant, again you should be looking at doing something more clever. **Agner Fog's VCL has some template metaprogramming to try to find efficient ways to to implement arbitrary shuffles.** https://github.com/vectorclass/version2 — Peter Cordes, Oct 11 '20 at 08:09
Since the permutation in known at compile time, if you're using GCC or clang you can use `__builtin_shuffle` (GCC) or `__builtin_shufflevector` (clang). They generally do a very good job selecting the best instructions. — nemequ, Oct 12 '20 at 02:52
@nemequ this is great. Clang has this really optimized (gcc appears to be missing a few cases). — Noah, Oct 12 '20 at 20:31
@Noah, if you want an abstraction that works on both compilers, feel free to steal https://github.com/simd-everywhere/simde/blob/master/simde/simde-common.h#L302-L326 (the file is MIT, but if you need it under a different license let me know; that macro is exclusively my fault so I can give you a license under other terms). — nemequ, Oct 12 '20 at 20:53

_mm256_permutexvar_epi8 and _mm256_permutexvar_epi16 AVX2 equivalents for compile-time-constant shuffles?

0 Answers0