3

I've been thinking about ecatmur's constexpr swap() function and I believe it's a special case of a more generic shuffle() function:

template <std::size_t ...I, std::size_t ...J, typename T>
constexpr T shuffle(T const i, std::index_sequence<J...>) noexcept
{
  return ((std::uint8_t(i >> 8 * I) << 8 * J) | ...);
}

I are source indices and J are destination indices. There are many different ways to implement shuffle() (I'll spare you the details), but, in my experience, the implementations don't induce gcc and clang to generate SIMD code equally well, when invoking shuffle() in a loop. Hence my question. Does there exist a formulation of shuffle(), that clang and gcc like to SIMDify more than the existing one, maybe using built-in functions or intrinsics? I am not aiming at a specific instruction set.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user1095108
  • 14,119
  • 9
  • 58
  • 116
  • Modern clang is somewhat good at auto-vectorizing shuffles, compared to other compilers. At least with compile-time constant data-movement I think I've seen it succeed. – Peter Cordes Nov 26 '20 at 12:36
  • 1
    It's very good, I'm not disputing that, but gcc isn't :) They both auto-vectorize this horror and clang does a better job. I'm just trying to see, if there's an optimal solution for both compilers. – user1095108 Nov 26 '20 at 12:37

1 Answers1

2
template <std::size_t ...I, std::size_t ...J, typename T>
constexpr T shuffle(T const i, std::index_sequence<J...>) noexcept
{
  return ((T{0xff} << 8 * J) & (I < J ? i << 8 * (J - I) : i >> 8 * (I - J)) | ...);
}

We see that a constant is ANDed to the result of a single shift operation, the operands being independent of each other, making the expression better suited for vectorization.

user1095108
  • 14,119
  • 9
  • 58
  • 116