transpose of 64bit elements using only avx, not avx2

Question

I want to implement a 64-bit transpose operation using only avx, not avx2. It should do this:

// in  = Hh Hl Lh Ll
//        |   X   |
// out = Hh Lh Hl Ll

This is how it would look with avx2:

#define SIMD_INLINE inline __attribute__ ((always_inline))

static SIMD_INLINE __m256i
x_mm256_transpose4x64_epi64(__m256i a)
{
  return _mm256_permute4x64_epi64(a, _MM_SHUFFLE(3,1,2,0));
}

This is the most efficient workaround without avx2 I could come up with (using 3 avx instructions):

static SIMD_INLINE __m256i
x_mm256_transpose4x64_epi64(__m256i a)
{
  __m256d in, x1, x2;
  // in = Hh Hl Lh Ll
  in = _mm256_castsi256_pd(a);
  // only lower 4 bit are used
  // in = Hh Hl Lh Ll
  //       0  1  0  1  = (0,0,1,1)
  // x1 = Hl Hh Ll Lh
  x1 = _mm256_permute_pd(in, _MM_SHUFFLE(0,0,1,1));
  // all 8 bit are used
  // x1 = Hl Hh Ll Lh
  //       0  0  1  1
  // x2 = Ll Lh Hl Hh
  x2 = _mm256_permute2f128_pd(x1, x1, _MM_SHUFFLE(0,0,1,1));
  // only lower 4 bit are used
  // in = Hh Hl Lh Ll
  // x2 = Ll Lh Hl Hh
  //       0  1  1  0 = (0,0,1,2)
  // ret: Hh Lh Hl Ll
  return _mm256_castpd_si256(_mm256_blend_pd(in, x2, _MM_SHUFFLE(0,0,1,2)));
}

The problem is that most avx swizzle operations (e.g. unpack) are operating on 128-bit lanes and do not cross the lane boundary.

Can anyone produce a more efficient implementation? Thanks a lot!

[codereview.stackexchange.com](http://codereview.stackexchange.com) - in my opinion this is the more appropriate site for such questions. — Gluttton, Jun 14 '16 at 09:14
@Gluttton: I disagree - optimisation questions are perfectly on-topic here - codereview is more for working code that can be improved idiomatically or stylistically. — Paul R, Jun 14 '16 at 10:44
@Ralf: I doubt you'll improve on your current 3 instruction solution, given all the limitations of AVX, but maybe someone will prove me wrong. — Paul R, Jun 14 '16 at 10:45
@PaulR Non-specific optimization questions are on-topic on Code Review. However, there is a specific question in the title of this question, so I think it's a fine Stack Overflow question. — 200_success, Jun 14 '16 at 10:51
https://stackoverflow.com/questions/19516585/shifting-sse-avx-registers-32-bits-left-and-right-while-shifting-in-zeros — Z boson, Jun 15 '16 at 08:33

score 4 · Accepted Answer · answered Jun 14 '16 at 12:45

I think 3 instructions is the best you can do. _mm256_blend_pd is very cheap (like vblendps and vpblendd), running on 2 ports in SnB/IvB, and all 3 vector execution ports in Haswell and later. (i.e. as cheap as a vector XOR or AND.) The other two both need the shuffle port, and that's unavoidable.

You will have a bypass delay of 1 cycle on SnB-family CPUs when vblendpd forwards its data from the FP domain to an integer instruction. Although with AVX1, there aren't any 256b integer instructions to forward to.

(source: see Agner Fog's insn tables, linked from the x86 tag wiki. His Optimizing Assembly guide also has some nice tables of shuffles, but doesn't focus on the in-lane challenges of AVX/AVX2.)

This pattern is almost achievable with two instructions, but not quite.

vshufpd (_mm256_shuffle_pd) gives you an in-lane 2-source shuffle, but with limitations on the data movement. Like the original SSE2 version, each destination element can only come from a fixed source element. The 8-bit immediate has room to encode two selections from four source elements, but they kept the hardware simple and only used a 1 bit selector for each dest element. The 256b version does allow a different shuffle for each 128b lane, so 4 bits of the imm8 are significant for vpshufd ymm.

Anyway, since the upper lane need to take its high element from the original, but the low lane needs to take its high element from the perm128 vector, neither choice of src1, src2 ordering can do what we need.

vshufpd I think is a byte shorter to encode than vpermilpd imm8. The only use-case for the immediate forms of vpermilps / vpermilpd seems to be as a load-and-shuffle. (vshufpd only works as a full in-lane shuffle when both source operands are the same). IDK if vpermildp might use less energy or something, since it only has one source.

Of course, compilers can use whatever instructions they want to get the job done; they're allowed to optimize code using intrinsics the same way they optimize code using the + operator (which doesn't always compile to an add instruction). Clang actually does basically ignore attempts at instruction-choice using intrinsics, since it represents shuffles in its own internal format, and optimizes them.

transpose of 64bit elements using only avx, not avx2

1 Answers1