I want to implement a 64-bit transpose operation using only avx, not avx2. It should do this:
// in = Hh Hl Lh Ll
// | X |
// out = Hh Lh Hl Ll
This is how it would look with avx2:
#define SIMD_INLINE inline __attribute__ ((always_inline))
static SIMD_INLINE __m256i
x_mm256_transpose4x64_epi64(__m256i a)
{
return _mm256_permute4x64_epi64(a, _MM_SHUFFLE(3,1,2,0));
}
This is the most efficient workaround without avx2 I could come up with (using 3 avx instructions):
static SIMD_INLINE __m256i
x_mm256_transpose4x64_epi64(__m256i a)
{
__m256d in, x1, x2;
// in = Hh Hl Lh Ll
in = _mm256_castsi256_pd(a);
// only lower 4 bit are used
// in = Hh Hl Lh Ll
// 0 1 0 1 = (0,0,1,1)
// x1 = Hl Hh Ll Lh
x1 = _mm256_permute_pd(in, _MM_SHUFFLE(0,0,1,1));
// all 8 bit are used
// x1 = Hl Hh Ll Lh
// 0 0 1 1
// x2 = Ll Lh Hl Hh
x2 = _mm256_permute2f128_pd(x1, x1, _MM_SHUFFLE(0,0,1,1));
// only lower 4 bit are used
// in = Hh Hl Lh Ll
// x2 = Ll Lh Hl Hh
// 0 1 1 0 = (0,0,1,2)
// ret: Hh Lh Hl Ll
return _mm256_castpd_si256(_mm256_blend_pd(in, x2, _MM_SHUFFLE(0,0,1,2)));
}
The problem is that most avx swizzle operations (e.g. unpack) are operating on 128-bit lanes and do not cross the lane boundary.
Can anyone produce a more efficient implementation? Thanks a lot!