shuffle() function and SIMD code generation

Question

I've been thinking about ecatmur's constexpr swap() function and I believe it's a special case of a more generic shuffle() function:

template <std::size_t ...I, std::size_t ...J, typename T>
constexpr T shuffle(T const i, std::index_sequence<J...>) noexcept
{
  return ((std::uint8_t(i >> 8 * I) << 8 * J) | ...);
}

I are source indices and J are destination indices. There are many different ways to implement shuffle() (I'll spare you the details), but, in my experience, the implementations don't induce gcc and clang to generate SIMD code equally well, when invoking shuffle() in a loop. Hence my question. Does there exist a formulation of shuffle(), that clang and gcc like to SIMDify more than the existing one, maybe using built-in functions or intrinsics? I am not aiming at a specific instruction set.

Modern clang is somewhat good at auto-vectorizing shuffles, compared to other compilers. At least with compile-time constant data-movement I think I've seen it succeed. — Peter Cordes, Nov 26 '20 at 12:36
It's very good, I'm not disputing that, but gcc isn't :) They both auto-vectorize this horror and clang does a better job. I'm just trying to see, if there's an optimal solution for both compilers. — user1095108, Nov 26 '20 at 12:37

score 2 · Answer 1 · answered Aug 06 '22 at 13:28

template <std::size_t ...I, std::size_t ...J, typename T>
constexpr T shuffle(T const i, std::index_sequence<J...>) noexcept
{
  return ((T{0xff} << 8 * J) & (I < J ? i << 8 * (J - I) : i >> 8 * (I - J)) | ...);
}

We see that a constant is ANDed to the result of a single shift operation, the operands being independent of each other, making the expression better suited for vectorization.

shuffle() function and SIMD code generation

1 Answers1