1

So I've come across another problem when dealing with AVX code. I have a case where I have 4 ymm registers that need to be split vertically to 4 other ymm registers

(ie. ymm0(ABCD) -> ymm4(A...), ymm5(B...), ymm6(C...), ymm7(D...)).

Here is an example:

// a, b, c, d are __m256 structs with [] operators to access xyzw
__m256d A = _mm256_setr_pd(a[0], b[0], c[0], d[0]);
__m256d B = _mm256_setr_pd(a[1], b[1], c[1], d[1]);
__m256d C = _mm256_setr_pd(a[2], b[2], c[2], d[2]);
__m256d D = _mm256_setr_pd(a[3], b[3], c[3], d[3]);
Paul R
  • 208,748
  • 37
  • 389
  • 560
James Nguyen
  • 1,079
  • 1
  • 10
  • 20
  • Load 4 contiguous vectors from a, b, c, d, then do a 4x4 transpose (which can be implemented quite efficiently - see [this question](http://stackoverflow.com/q/36167517/253056)). – Paul R Apr 06 '17 at 10:17
  • @PaulR Thanks for the link. I didn't know I was asking how to do a 4x4 transposition of a matrix. – James Nguyen Apr 06 '17 at 21:36

1 Answers1

1

Just putting Paul's comment into an answer:

My question is about how to a matrix transposition which is easily done in AVX as indicated with the link he provided.

Here's my implementation for those who come across here:

void Transpose(__m256d* A, __m256d* T)
{
    __m256d t0 = _mm256_shuffle_pd(A[0], A[1], 0b0000);
    __m256d t1 = _mm256_shuffle_pd(A[0], A[1], 0b1111);
    __m256d t2 = _mm256_shuffle_pd(A[2], A[3], 0b0000);
    __m256d t3 = _mm256_shuffle_pd(A[2], A[3], 0b1111);
    T[0] = _mm256_permute2f128_pd(t0, t2, 0b0100000);
    T[1] = _mm256_permute2f128_pd(t1, t3, 0b0100000);
    T[2] = _mm256_permute2f128_pd(t0, t2, 0b0110001);
    T[3] = _mm256_permute2f128_pd(t1, t3, 0b0110001);
}

This function cuts the number of instructions in about half on full optimization as compared to my previous attempt

James Nguyen
  • 1,079
  • 1
  • 10
  • 20