How to transpose a 16x16 matrix using SIMD instructions?

Question

I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations.

Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how can I transpose the matrix with purely SIMD instructions?

There're already solutions to transposing 4x4 or 8x8 matrices with SSE and AVX2 respectively. But I couldn't figure out how to extend it to 16x16 with AVX-512.

Any ideas?

Often the fastest way to do something is to do nothing instead - basically, give each matrix a "transposed" flag and just invert that flag. Of course this means that you need to check the "transposed" flag and swap column index and row index in any other code that might deal with transposed matrices. E.g. if you have a function to add 2 matrices you might end up with 3 cases (neither transposed, one transposed, both transposed) where the result of the addition is always a matrix that's not transposed. — Brendan, Apr 12 '15 at 07:31
Out of curiosity, could you explain why you are interested in a 16x16 transpose? E.g. Is this for a kernel for a larger transpose? Do the reads/writes matter do you or is this generated data? — Z boson, Apr 12 '15 at 18:00
@Zboson This is part of an encryption algorithm we're trying to optimize with AVX512. In fact we can use the gather instruction to transpose the matrix when loading from memory. But we managed to do this with SSE/AVX2 when there're no gather/scatter instructions, so I'm just curious how we can do the same thing with AVX512, i.e., in-register transposing. — lei_z, Apr 13 '15 at 00:31
@lei.april, okay, in that case my solution does the in-register transposing so when AVX512 comes out you can compare it to the gather/scatter performance. It would be great if the gather/scatter performance was better but I would not count on it. — Z boson, Apr 13 '15 at 07:20
@Zboson Some rough latency/throughput numbers are out for KNL. As expected, gather/scatter is still slow. 2 elements/cycles load, 1/cycle store. So 8 cycles/float-gather, and 16 cycles/float-scatter. IOW, the gather/scatter instructions are still breaking into separate uops for each element and going into their appropriate ports. It's just more efficient than in previous generations where they had a ton of other extra uops. — Mysticial, Aug 17 '16 at 16:24
I didn't see a way to do better than this a year ago. And it looks like Intel doesn't have anything better yet. Since fundamentally, gather/scatter really are sets of smaller independent operations. I'm not entirely sure how GPU caches can handle massive parallel loads from hundreds of threads simultaneously, but they do seem to have problems with scatters. — Mysticial, Aug 17 '16 at 16:26
@Mysticial the HPC group at work gave me an account on their Knights Landing card with AVX512. I tried my code and it worked first try. That's good to know. I have not done any performance tests yet. I got the account as of about 30 minutes ago. — Z boson, Dec 19 '16 at 14:34
@Zboson Lucky you. I'm unlikely to get any AVX512 hardware until Skylake Purley. — Mysticial, Dec 19 '16 at 15:39
@lei.april I just added a new answer to this question based on tests with real AVX512 hardware. Do you have AVX512 hardware yourself to test this. If so what results did you get? I get that gather is still slower than my method. — Z boson, Dec 21 '16 at 12:51

score 25 · Accepted Answer · edited May 23 '17 at 11:46

For two operand instructions using SIMD you can show that the number of operations necessary to transpose a nxn matrix is n*log_2(n) whereas using scalar operations it's O(n^2). In fact, later I'll show that the number of read and write operations using the scalar registers is 2*n*(n-1). Below is a table showing the number of operations to transpose 4x4, 8x8, 16x16, and 32x32 matrices using SSE, AVX, AVX512, and AVX1024 compared to the scalar operations

n            4(SSE)          8(AVX)    16(AVX512)    32(AVX1024)  
SIMD ops          8              24           64            160
SIMD +r/w ops    16              40           96            224     
Scalar r/w ops   24             112          480           1984

where SIMD +r/w ops includes the read and write operations (n*log_2(n) + 2*n).

The reason the SIMD transpose can be done in n*log_2(n) operations is that the algorithm is:

permute n 32-bit rows
permute n 64-bit rows
...
permute n simd_width/2-bit rows

For example, for 4x4 there are 4 rows and therefore you have to permute 32-bit lanes 4 times and then 64-bit lanes 4 times. For 16x16 you have to permute 32-bit lanes , 64-bit lanes, 128-bit lanes, and finally 256-lanes 16 times for each.

I already showed that 8x8 can be done with 24 operations with AVX. So the question is how to do this for 16x16 using AVX512 in 64 operations? The general algorithm is:

interleave 32-bit lanes using 
    8x _mm512_unpacklo_epi32
    8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
    8x _mm512_unpacklo_epi64 
    8x _mm512_unpackhi_epi64 
permute 128-bit lanes using
   16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
   16x _mm512_shuffle_i32x4

Here is untested code doing this

    //given __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;
    __m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;

    t0 = _mm512_unpacklo_epi32(r0,r1); //   0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29 
    t1 = _mm512_unpackhi_epi32(r0,r1); //   2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
    t2 = _mm512_unpacklo_epi32(r2,r3); //  32  48  33  49 ...
    t3 = _mm512_unpackhi_epi32(r2,r3); //  34  50  35  51 ...
    t4 = _mm512_unpacklo_epi32(r4,r5); //  64  80  65  81 ...  
    t5 = _mm512_unpackhi_epi32(r4,r5); //  66  82  67  83 ...
    t6 = _mm512_unpacklo_epi32(r6,r7); //  96 112  97 113 ...
    t7 = _mm512_unpackhi_epi32(r6,r7); //  98 114  99 115 ...
    t8 = _mm512_unpacklo_epi32(r8,r9); // 128 ...
    t9 = _mm512_unpackhi_epi32(r8,r9); // 130 ...
    ta = _mm512_unpacklo_epi32(ra,rb); // 160 ...
    tb = _mm512_unpackhi_epi32(ra,rb); // 162 ...
    tc = _mm512_unpacklo_epi32(rc,rd); // 196 ...
    td = _mm512_unpackhi_epi32(rc,rd); // 198 ...
    te = _mm512_unpacklo_epi32(re,rf); // 228 ...
    tf = _mm512_unpackhi_epi32(re,rf); // 230 ...

    r0 = _mm512_unpacklo_epi64(t0,t2); //   0  16  32  48 ...
    r1 = _mm512_unpackhi_epi64(t0,t2); //   1  17  33  49 ...
    r2 = _mm512_unpacklo_epi64(t1,t3); //   2  18  34  49 ...
    r3 = _mm512_unpackhi_epi64(t1,t3); //   3  19  35  51 ...
    r4 = _mm512_unpacklo_epi64(t4,t6); //  64  80  96 112 ...  
    r5 = _mm512_unpackhi_epi64(t4,t6); //  65  81  97 114 ...
    r6 = _mm512_unpacklo_epi64(t5,t7); //  66  82  98 113 ...
    r7 = _mm512_unpackhi_epi64(t5,t7); //  67  83  99 115 ...
    r8 = _mm512_unpacklo_epi64(t8,ta); // 128 144 160 176 ...  
    r9 = _mm512_unpackhi_epi64(t8,ta); // 129 145 161 178 ...
    ra = _mm512_unpacklo_epi64(t9,tb); // 130 146 162 177 ... 
    rb = _mm512_unpackhi_epi64(t9,tb); // 131 147 163 179 ...
    rc = _mm512_unpacklo_epi64(tc,te); // 192 208 228 240 ... 
    rd = _mm512_unpackhi_epi64(tc,te); // 193 209 229 241 ...
    re = _mm512_unpacklo_epi64(td,tf); // 194 210 230 242 ...
    rf = _mm512_unpackhi_epi64(td,tf); // 195 211 231 243 ...

    t0 = _mm512_shuffle_i32x4(r0, r4, 0x88); //   0  16  32  48   8  24  40  56  64  80  96  112 ...
    t1 = _mm512_shuffle_i32x4(r1, r5, 0x88); //   1  17  33  49 ...
    t2 = _mm512_shuffle_i32x4(r2, r6, 0x88); //   2  18  34  50 ...
    t3 = _mm512_shuffle_i32x4(r3, r7, 0x88); //   3  19  35  51 ...
    t4 = _mm512_shuffle_i32x4(r0, r4, 0xdd); //   4  20  36  52 ...
    t5 = _mm512_shuffle_i32x4(r1, r5, 0xdd); //   5  21  37  53 ...
    t6 = _mm512_shuffle_i32x4(r2, r6, 0xdd); //   6  22  38  54 ...
    t7 = _mm512_shuffle_i32x4(r3, r7, 0xdd); //   7  23  39  55 ...
    t8 = _mm512_shuffle_i32x4(r8, rc, 0x88); // 128 144 160 176 ...
    t9 = _mm512_shuffle_i32x4(r9, rd, 0x88); // 129 145 161 177 ...
    ta = _mm512_shuffle_i32x4(ra, re, 0x88); // 130 146 162 178 ...
    tb = _mm512_shuffle_i32x4(rb, rf, 0x88); // 131 147 163 179 ...
    tc = _mm512_shuffle_i32x4(r8, rc, 0xdd); // 132 148 164 180 ...
    td = _mm512_shuffle_i32x4(r9, rd, 0xdd); // 133 149 165 181 ...
    te = _mm512_shuffle_i32x4(ra, re, 0xdd); // 134 150 166 182 ...
    tf = _mm512_shuffle_i32x4(rb, rf, 0xdd); // 135 151 167 183 ...

    r0 = _mm512_shuffle_i32x4(t0, t8, 0x88); //   0  16  32  48  64  80  96 112 ... 240
    r1 = _mm512_shuffle_i32x4(t1, t9, 0x88); //   1  17  33  49  66  81  97 113 ... 241
    r2 = _mm512_shuffle_i32x4(t2, ta, 0x88); //   2  18  34  50  67  82  98 114 ... 242
    r3 = _mm512_shuffle_i32x4(t3, tb, 0x88); //   3  19  35  51  68  83  99 115 ... 243
    r4 = _mm512_shuffle_i32x4(t4, tc, 0x88); //   4 ...
    r5 = _mm512_shuffle_i32x4(t5, td, 0x88); //   5 ...
    r6 = _mm512_shuffle_i32x4(t6, te, 0x88); //   6 ...
    r7 = _mm512_shuffle_i32x4(t7, tf, 0x88); //   7 ...
    r8 = _mm512_shuffle_i32x4(t0, t8, 0xdd); //   8 ...
    r9 = _mm512_shuffle_i32x4(t1, t9, 0xdd); //   9 ...
    ra = _mm512_shuffle_i32x4(t2, ta, 0xdd); //  10 ...
    rb = _mm512_shuffle_i32x4(t3, tb, 0xdd); //  11 ...
    rc = _mm512_shuffle_i32x4(t4, tc, 0xdd); //  12 ...
    rd = _mm512_shuffle_i32x4(t5, td, 0xdd); //  13 ...
    re = _mm512_shuffle_i32x4(t6, te, 0xdd); //  14 ...
    rf = _mm512_shuffle_i32x4(t7, tf, 0xdd); //  15  31  47  63  79  96 111 127 ... 255

I got the idea for using _mm512_shufflei32x4 by looking at transposing a 4x4 matrix using _mm_shuffle_ps (which is what MSVC uses in _MM_TRANSPOSE4_PS but not GCC and ICC).

__m128 tmp0 ,tmp1, tmp2, tmp3;
tmp0 = _mm_shuffle_ps(row0, row1, 0x88); // 0 2 4 6
tmp1 = _mm_shuffle_ps(row0, row1, 0xdd); // 1 3 5 7
tmp2 = _mm_shuffle_ps(row2, row3, 0x88); // 8 a c e
tmp3 = _mm_shuffle_ps(row2, row3, 0xdd); // 9 b d f

row0 = _mm_shuffle_ps(tmp0, tmp2, 0x88); // 0 4 8 c 
row1 = _mm_shuffle_ps(tmp1, tmp3, 0x88); // 1 5 9 d
row2 = _mm_shuffle_ps(tmp0, tmp2, 0xdd); // 2 6 a e 
row3 = _mm_shuffle_ps(tmp1, tmp3, 0xdd); // 3 7 b f

the same idea applies to _mm512_shuffle_i32x4 but now the lanes are 128-bit instead of 32-bit and there are 16 rows instead of 4 rows.

Finally, to compare to scalar operations I modified Example 9.5a from Agner Fog's optimizing C++ manual

#define SIZE 16
void transpose(int a[SIZE][SIZE]) { // function to transpose matrix
    // define a macro to swap two array elements:
    #define swapd(x,y) {temp=x; x=y; y=temp;}
    int r, c; int temp;
    for (r = 1; r < SIZE; r++) {
        for (c = 0; c < r; c++) {
            swapd(a[r][c], a[c][r]);
        }
    }
}

this does n*(n-1)/2 swaps (because the diagonal does not need to be swapped). The swaps from assembly for 16x16 look like

mov     r8d, DWORD PTR [rax+68]
mov     r9d, DWORD PTR [rdx+68]
mov     DWORD PTR [rax+68], r9d
mov     DWORD PTR [rdx+68], r8d

so the number of read/write operations using the scalar registers is 2*n*(n-1).

＋1, as ugly as this is, it's probably still going to be faster than using 16 gather-loads. — Mysticial, Apr 12 '15 at 10:51
@Mysticial, [is it true that only xeon and workstation Skylake processors will have AVX512](http://www.kitguru.net/components/cpu/anton-shilov/intel-skylake-processors-for-pcs-will-not-support-avx-512-instructions/)? If this is the case then what the #@$! is the point of Skylake??? This is very disappointing news if it's true. What makes Skylake a "tock" without AVX512? — Z boson, Jun 01 '15 at 13:05
Yeah, I didn't realize it was that bad until the recent leak about Purley. It seems that it's going to be Knights Landing in Q1-2016 and Skylake Xeon with AVX512 in (late?) 2017. Intel processors have typically been divided into the notebook/low-end desktop (socket 115x) and server/high-end desktop (socket 2011-x) lines. It seems that AVX512 for Skylake will only be on the server/high-end desktop line for Skylake. That's potentially later than Cannonlake for notebook/low-end desktop. — Mysticial, Jun 01 '15 at 13:30
Of course, I'm making these guesses based on the recent leaks as well as my (limited) knowledge of Intel's product lines. So I could definitely be wrong. There's a "Xeon Skylake" probably for socket 1151 coming out in Q3 2015. But it's probably just a glorified desktop processor so I'm not confident that it would have AVX512. — Mysticial, Jun 01 '15 at 13:39
@Mysticial, okay, I think I understand. The "tock" in Skylake must be for the IGP. But the IGP does not appear to have a tick-tock cycle. It improves significantly with each generation. I really don't like this trend of more and more of Intel consumer processors being a IGP (the Silicon of my Haswell processor is more than 50% GPU). GPUs are closed "source" and are not typically programmed directly but through device drivers. Not only that, but double precision sucks. This is what happens when AMD does not offer competition. — Z boson, Jun 01 '15 at 16:09
From a user-visible feature stand-point, Skylake isn't much. Architecturally, it's definitely a huge redesign to the pipeline to make room for AVX512 - which they aren't enabling yet. AVX512 isn't as simple as widening the execution units. Stuff like mask registers and 3-input dependency uops need to be designed into the pipeline. — Mysticial, Jun 01 '15 at 17:47
As for why they aren't enabling AVX512 yet, I have several guesses: 1) There isn't enough silicon area on desktop 14nm. 2) It takes so much area that it affects yields. 3) Intel may be intentionally delaying to give AMD time to catch up or be ripped apart by anti-trust. — Mysticial, Jun 01 '15 at 17:50
@Zboson Skylake client is a new uarch, which is what "tock" means (http://www.intel.com/content/www/us/en/silicon-innovations/intel-tick-tock-model-general.html). There is a significant boost in GPU performance (http://cdn.arstechnica.net/wp-content/uploads/2015/08/02.png), which is far more important in the client space than AVX-512. AVX-512 is primarily of interest to HPC folks, who are going to buy Xeon or Xeon Phi processors. — Jeff Hammond, Dec 23 '15 at 00:14
@Jeff, AVX512 interests me because I have written a real time ray tracer (not just a ray caster) in OpenCL and it runs at least five times faster on my six year old GTX 580 than with every Intel processor I have used so far. — Z boson, Dec 23 '15 at 07:51
Just for fun, someone should do a 64x64 byte-wide transpose. It's doable with AVX512-VBMI for sure. Maybe also with AVX512-BW. Not that I've tried, nor do I have a use for it. — Mysticial, Aug 16 '16 at 21:36
@Mysticial, you might find [this](http://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20didn%27t%20Larrabee%20fail%3F%5D%5D) intersting. — Z boson, Aug 17 '16 at 12:48
I saw that article already. Yeah, interesting. KNL is available, but AFAICT, only through the Ninja developer platform for consumers. I don't see it anywhere else yet other than for universities and such. — Mysticial, Aug 17 '16 at 16:15
@Mysticial I noticed you updated the AVX512 tag. You might want to mention that there will be some deep learning instruction extensions for AVX512 as well `AVX512_4VNNIW` and `AVX512_4FMAPS` http://lemire.me/blog/2016/10/14/intel-will-add-deep-learning-instructions-to-its-processors/. I just took at job in machine learning/AI and Big Data. — Z boson, Oct 18 '16 at 11:35
@Zboson Woah wtf. This is new to me. And are they insane? Their mouth is now 3 generations ahead of their action. — Mysticial, Oct 18 '16 at 14:57
Looks like they haven't released any details of the actual instructions yet. — Mysticial, Oct 18 '16 at 15:01
Btw, Knights Landing permute/shuffles that pull from two vectors instead of one have half the throughput. I don't have the hardware to test it, but I think it may be faster to use some alternative such as: `_mm512_unpacklo_epi64(a, b) -> _mm512_mask_permutex_epi64(a, 0xaa, b, 177)` or `_mm512_shuffle_i64x2(a, b, 68) -> _mm512_inserti64x4(a, _mm512_castsi512_si256(b), 1)` — Mysticial, Jan 20 '17 at 04:37
That first example in particular has the drawback of requiring a mask and being destructive, so you'll need a reg-reg move for every pair of them. But it generalizes to a lot of different shuffle combinations. It's possible to build an 8x8 or 16x16 transposes with only single-wide permute/shuffles which have 1/cycle throughput on KNL. The potential improvement is about 25% since the work drops from a ratio of 4 to 3 provided you have all the mask and permute vectors pre-loaded. — Mysticial, Jan 20 '17 at 04:40
@Mysticial, I did something similar in my last update [here](http://stackoverflow.com/a/41262731/2542702) at the end but I used `permutexvar_` instead of `permutex_`. I will try your suggests if I have time this week. — Z boson, Jan 23 '17 at 10:44
How do your formulas scale when the row size is greater than the vector width? E.g., an 8x8 tranpose but only using AVX2, not AVX-512. Or large transposes where even any future instruction set isn't going to be wide enough to hold a row (I'm not holding my breath for AVX-65536)? — BeeOnRope, Oct 18 '19 at 18:25

score 6 · Answer 2 · edited May 23 '17 at 11:46

I recently got access to Xeon Phi Knights Landing hardware which has AVX512. Specifically the hardware I am using is a Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz (http://ark.intel.com/products/94035/Intel-Xeon-Phi-Processor-7250-16GB-1_40-GHz-68-core). This is not an auxiliary card. The Xeon Phi is the main computer.

I tested the AVX512 gather instructions compared to my method here https://stackoverflow.com/a/29587984/2542702 and it appears that gather is still slower. My code in that answer worked the first try with no errors.

I have not written intrinsics in about 3 months or thought much about optimization in this time so maybe my test is not robust enough. There is certainly some overhead but nevertheless I feel confident that the results clearly show that gather is slower in this case.

I only tested with ICC 17.0.0 because the currently installed OS is only CentOS 7.2 with Linux Kernel 3.10 and GCC 4.8.5 and GCC 4.8 does not support AVX512. I may persuade the HPC group at my work to upgrade.

I looked at the assembly to make sure it was generating AVX512 instructions but I have not analyzed it carefully.

//icc -O3 -xCOMMON-AVX512 tran.c -fopenmp
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>    

void tran(int* mat, int* matT) {
    int i,j;

    __m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;
    __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;

    r0 = _mm512_load_epi32(&mat[ 0*16]);
    r1 = _mm512_load_epi32(&mat[ 1*16]);
    r2 = _mm512_load_epi32(&mat[ 2*16]);
    r3 = _mm512_load_epi32(&mat[ 3*16]);
    r4 = _mm512_load_epi32(&mat[ 4*16]);
    r5 = _mm512_load_epi32(&mat[ 5*16]);
    r6 = _mm512_load_epi32(&mat[ 6*16]);
    r7 = _mm512_load_epi32(&mat[ 7*16]);
    r8 = _mm512_load_epi32(&mat[ 8*16]);
    r9 = _mm512_load_epi32(&mat[ 9*16]);
    ra = _mm512_load_epi32(&mat[10*16]);
    rb = _mm512_load_epi32(&mat[11*16]);
    rc = _mm512_load_epi32(&mat[12*16]);
    rd = _mm512_load_epi32(&mat[13*16]);
    re = _mm512_load_epi32(&mat[14*16]);
    rf = _mm512_load_epi32(&mat[15*16]);

    t0 = _mm512_unpacklo_epi32(r0,r1); //   0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29 
    t1 = _mm512_unpackhi_epi32(r0,r1); //   2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
    t2 = _mm512_unpacklo_epi32(r2,r3); //  32  48  33  49 ...
    t3 = _mm512_unpackhi_epi32(r2,r3); //  34  50  35  51 ...
    t4 = _mm512_unpacklo_epi32(r4,r5); //  64  80  65  81 ...  
    t5 = _mm512_unpackhi_epi32(r4,r5); //  66  82  67  83 ...
    t6 = _mm512_unpacklo_epi32(r6,r7); //  96 112  97 113 ...
    t7 = _mm512_unpackhi_epi32(r6,r7); //  98 114  99 115 ...
    t8 = _mm512_unpacklo_epi32(r8,r9); // 128 ...
    t9 = _mm512_unpackhi_epi32(r8,r9); // 130 ...
    ta = _mm512_unpacklo_epi32(ra,rb); // 160 ...
    tb = _mm512_unpackhi_epi32(ra,rb); // 162 ...
    tc = _mm512_unpacklo_epi32(rc,rd); // 196 ...
    td = _mm512_unpackhi_epi32(rc,rd); // 198 ...
    te = _mm512_unpacklo_epi32(re,rf); // 228 ...
    tf = _mm512_unpackhi_epi32(re,rf); // 230 ...

    r0 = _mm512_unpacklo_epi64(t0,t2); //   0  16  32  48 ...
    r1 = _mm512_unpackhi_epi64(t0,t2); //   1  17  33  49 ...
    r2 = _mm512_unpacklo_epi64(t1,t3); //   2  18  34  49 ...
    r3 = _mm512_unpackhi_epi64(t1,t3); //   3  19  35  51 ...
    r4 = _mm512_unpacklo_epi64(t4,t6); //  64  80  96 112 ...  
    r5 = _mm512_unpackhi_epi64(t4,t6); //  65  81  97 114 ...
    r6 = _mm512_unpacklo_epi64(t5,t7); //  66  82  98 113 ...
    r7 = _mm512_unpackhi_epi64(t5,t7); //  67  83  99 115 ...
    r8 = _mm512_unpacklo_epi64(t8,ta); // 128 144 160 176 ...  
    r9 = _mm512_unpackhi_epi64(t8,ta); // 129 145 161 178 ...
    ra = _mm512_unpacklo_epi64(t9,tb); // 130 146 162 177 ... 
    rb = _mm512_unpackhi_epi64(t9,tb); // 131 147 163 179 ...
    rc = _mm512_unpacklo_epi64(tc,te); // 192 208 228 240 ... 
    rd = _mm512_unpackhi_epi64(tc,te); // 193 209 229 241 ...
    re = _mm512_unpacklo_epi64(td,tf); // 194 210 230 242 ...
    rf = _mm512_unpackhi_epi64(td,tf); // 195 211 231 243 ...

    t0 = _mm512_shuffle_i32x4(r0, r4, 0x88); //   0  16  32  48   8  24  40  56  64  80  96  112 ...
    t1 = _mm512_shuffle_i32x4(r1, r5, 0x88); //   1  17  33  49 ...
    t2 = _mm512_shuffle_i32x4(r2, r6, 0x88); //   2  18  34  50 ...
    t3 = _mm512_shuffle_i32x4(r3, r7, 0x88); //   3  19  35  51 ...
    t4 = _mm512_shuffle_i32x4(r0, r4, 0xdd); //   4  20  36  52 ...
    t5 = _mm512_shuffle_i32x4(r1, r5, 0xdd); //   5  21  37  53 ...
    t6 = _mm512_shuffle_i32x4(r2, r6, 0xdd); //   6  22  38  54 ...
    t7 = _mm512_shuffle_i32x4(r3, r7, 0xdd); //   7  23  39  55 ...
    t8 = _mm512_shuffle_i32x4(r8, rc, 0x88); // 128 144 160 176 ...
    t9 = _mm512_shuffle_i32x4(r9, rd, 0x88); // 129 145 161 177 ...
    ta = _mm512_shuffle_i32x4(ra, re, 0x88); // 130 146 162 178 ...
    tb = _mm512_shuffle_i32x4(rb, rf, 0x88); // 131 147 163 179 ...
    tc = _mm512_shuffle_i32x4(r8, rc, 0xdd); // 132 148 164 180 ...
    td = _mm512_shuffle_i32x4(r9, rd, 0xdd); // 133 149 165 181 ...
    te = _mm512_shuffle_i32x4(ra, re, 0xdd); // 134 150 166 182 ...
    tf = _mm512_shuffle_i32x4(rb, rf, 0xdd); // 135 151 167 183 ...

    r0 = _mm512_shuffle_i32x4(t0, t8, 0x88); //   0  16  32  48  64  80  96 112 ... 240
    r1 = _mm512_shuffle_i32x4(t1, t9, 0x88); //   1  17  33  49  66  81  97 113 ... 241
    r2 = _mm512_shuffle_i32x4(t2, ta, 0x88); //   2  18  34  50  67  82  98 114 ... 242
    r3 = _mm512_shuffle_i32x4(t3, tb, 0x88); //   3  19  35  51  68  83  99 115 ... 243
    r4 = _mm512_shuffle_i32x4(t4, tc, 0x88); //   4 ...
    r5 = _mm512_shuffle_i32x4(t5, td, 0x88); //   5 ...
    r6 = _mm512_shuffle_i32x4(t6, te, 0x88); //   6 ...
    r7 = _mm512_shuffle_i32x4(t7, tf, 0x88); //   7 ...
    r8 = _mm512_shuffle_i32x4(t0, t8, 0xdd); //   8 ...
    r9 = _mm512_shuffle_i32x4(t1, t9, 0xdd); //   9 ...
    ra = _mm512_shuffle_i32x4(t2, ta, 0xdd); //  10 ...
    rb = _mm512_shuffle_i32x4(t3, tb, 0xdd); //  11 ...
    rc = _mm512_shuffle_i32x4(t4, tc, 0xdd); //  12 ...
    rd = _mm512_shuffle_i32x4(t5, td, 0xdd); //  13 ...
    re = _mm512_shuffle_i32x4(t6, te, 0xdd); //  14 ...
    rf = _mm512_shuffle_i32x4(t7, tf, 0xdd); //  15  31  47  63  79  96 111 127 ... 255

    _mm512_store_epi32(&matT[ 0*16], r0);
    _mm512_store_epi32(&matT[ 1*16], r1);
    _mm512_store_epi32(&matT[ 2*16], r2);
    _mm512_store_epi32(&matT[ 3*16], r3);
    _mm512_store_epi32(&matT[ 4*16], r4);
    _mm512_store_epi32(&matT[ 5*16], r5);
    _mm512_store_epi32(&matT[ 6*16], r6);
    _mm512_store_epi32(&matT[ 7*16], r7);
    _mm512_store_epi32(&matT[ 8*16], r8);
    _mm512_store_epi32(&matT[ 9*16], r9);
    _mm512_store_epi32(&matT[10*16], ra);
    _mm512_store_epi32(&matT[11*16], rb);
    _mm512_store_epi32(&matT[12*16], rc);
    _mm512_store_epi32(&matT[13*16], rd);
    _mm512_store_epi32(&matT[14*16], re);
    _mm512_store_epi32(&matT[15*16], rf);
}

void gather(int *mat, int *matT) {
    int i,j;
    int index[16] __attribute__((aligned(64)));

    __m512i vindex;

    for(i=0; i<16; i++) index[i] = 16*i;
    for(i=0; i<256; i++) mat[i] = i;
    vindex = _mm512_load_epi32(index);

    for(i=0; i<16; i++) 
    _mm512_store_epi32(&matT[16*i], _mm512_i32gather_epi32(vindex, &mat[i], 4));
}

int verify(int *mat) {
    int i,j;
    int error = 0;
    for(i=0; i<16; i++) {
      for(j=0; j<16; j++) {
        if(mat[j*16+i] != i*16+j) error++;
      }
    }
    return error;
}

void print_mat(int *mat) {
    int i,j;
    for(i=0; i<16; i++) {
      for(j=0; j<16; j++) printf("%2X ", mat[i*16+j]);
      puts("");
    }
    puts("");
}

int main(void) {
    int i,j, rep;
    int mat[256] __attribute__((aligned(64)));
    int matT[256] __attribute__((aligned(64)));
    double dtime;

    rep = 10000000;
    for(i=0; i<256; i++) mat[i] = i;
    print_mat(mat);

    gather(mat, matT);
    for(i=0; i<256; i++) mat[i] = i;
    dtime = -omp_get_wtime();
    for(i=0; i<rep; i++) gather(mat, matT);
    dtime += omp_get_wtime();
    printf("errors %d\n", verify(matT));
    printf("dtime %f\n", dtime);
    print_mat(matT);

    tran(mat,matT);
    dtime = -omp_get_wtime();
    for(i=0; i<rep; i++) tran(mat, matT);
    dtime += omp_get_wtime();
    printf("errors %d\n", verify(matT));
    printf("dtime %f\n", dtime);
    print_mat(matT);
}

The gather function in this case takes 1.5 s and the tran function 1.15 s. If anyone sees an error or has any suggestions for my test please let me know. I am only beginning to get experience with AVX512 and Knights Landing.

I tried to remove some of the overhead and succeeded nevertheless gather still appears to be slower

#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>   

void tran(int* mat, int* matT, int rep) {
    int i;

    __m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;
    __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;

    for(i=0; i<rep; i++) {

    r0 = _mm512_load_epi32(&mat[ 0*16]);
    r1 = _mm512_load_epi32(&mat[ 1*16]);
    r2 = _mm512_load_epi32(&mat[ 2*16]);
    r3 = _mm512_load_epi32(&mat[ 3*16]);
    r4 = _mm512_load_epi32(&mat[ 4*16]);
    r5 = _mm512_load_epi32(&mat[ 5*16]);
    r6 = _mm512_load_epi32(&mat[ 6*16]);
    r7 = _mm512_load_epi32(&mat[ 7*16]);
    r8 = _mm512_load_epi32(&mat[ 8*16]);
    r9 = _mm512_load_epi32(&mat[ 9*16]);
    ra = _mm512_load_epi32(&mat[10*16]);
    rb = _mm512_load_epi32(&mat[11*16]);
    rc = _mm512_load_epi32(&mat[12*16]);
    rd = _mm512_load_epi32(&mat[13*16]);
    re = _mm512_load_epi32(&mat[14*16]);
    rf = _mm512_load_epi32(&mat[15*16]);

    t0 = _mm512_unpacklo_epi32(r0,r1); //   0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29 
    t1 = _mm512_unpackhi_epi32(r0,r1); //   2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
    t2 = _mm512_unpacklo_epi32(r2,r3); //  32  48  33  49 ...
    t3 = _mm512_unpackhi_epi32(r2,r3); //  34  50  35  51 ...
    t4 = _mm512_unpacklo_epi32(r4,r5); //  64  80  65  81 ...  
    t5 = _mm512_unpackhi_epi32(r4,r5); //  66  82  67  83 ...
    t6 = _mm512_unpacklo_epi32(r6,r7); //  96 112  97 113 ...
    t7 = _mm512_unpackhi_epi32(r6,r7); //  98 114  99 115 ...
    t8 = _mm512_unpacklo_epi32(r8,r9); // 128 ...
    t9 = _mm512_unpackhi_epi32(r8,r9); // 130 ...
    ta = _mm512_unpacklo_epi32(ra,rb); // 160 ...
    tb = _mm512_unpackhi_epi32(ra,rb); // 162 ...
    tc = _mm512_unpacklo_epi32(rc,rd); // 196 ...
    td = _mm512_unpackhi_epi32(rc,rd); // 198 ...
    te = _mm512_unpacklo_epi32(re,rf); // 228 ...
    tf = _mm512_unpackhi_epi32(re,rf); // 230 ...

    r0 = _mm512_unpacklo_epi64(t0,t2); //   0  16  32  48 ...
    r1 = _mm512_unpackhi_epi64(t0,t2); //   1  17  33  49 ...
    r2 = _mm512_unpacklo_epi64(t1,t3); //   2  18  34  49 ...
    r3 = _mm512_unpackhi_epi64(t1,t3); //   3  19  35  51 ...
    r4 = _mm512_unpacklo_epi64(t4,t6); //  64  80  96 112 ...  
    r5 = _mm512_unpackhi_epi64(t4,t6); //  65  81  97 114 ...
    r6 = _mm512_unpacklo_epi64(t5,t7); //  66  82  98 113 ...
    r7 = _mm512_unpackhi_epi64(t5,t7); //  67  83  99 115 ...
    r8 = _mm512_unpacklo_epi64(t8,ta); // 128 144 160 176 ...  
    r9 = _mm512_unpackhi_epi64(t8,ta); // 129 145 161 178 ...
    ra = _mm512_unpacklo_epi64(t9,tb); // 130 146 162 177 ... 
    rb = _mm512_unpackhi_epi64(t9,tb); // 131 147 163 179 ...
    rc = _mm512_unpacklo_epi64(tc,te); // 192 208 228 240 ... 
    rd = _mm512_unpackhi_epi64(tc,te); // 193 209 229 241 ...
    re = _mm512_unpacklo_epi64(td,tf); // 194 210 230 242 ...
    rf = _mm512_unpackhi_epi64(td,tf); // 195 211 231 243 ...

    t0 = _mm512_shuffle_i32x4(r0, r4, 0x88); //   0  16  32  48   8  24  40  56  64  80  96  112 ...
    t1 = _mm512_shuffle_i32x4(r1, r5, 0x88); //   1  17  33  49 ...
    t2 = _mm512_shuffle_i32x4(r2, r6, 0x88); //   2  18  34  50 ...
    t3 = _mm512_shuffle_i32x4(r3, r7, 0x88); //   3  19  35  51 ...
    t4 = _mm512_shuffle_i32x4(r0, r4, 0xdd); //   4  20  36  52 ...
    t5 = _mm512_shuffle_i32x4(r1, r5, 0xdd); //   5  21  37  53 ...
    t6 = _mm512_shuffle_i32x4(r2, r6, 0xdd); //   6  22  38  54 ...
    t7 = _mm512_shuffle_i32x4(r3, r7, 0xdd); //   7  23  39  55 ...
    t8 = _mm512_shuffle_i32x4(r8, rc, 0x88); // 128 144 160 176 ...
    t9 = _mm512_shuffle_i32x4(r9, rd, 0x88); // 129 145 161 177 ...
    ta = _mm512_shuffle_i32x4(ra, re, 0x88); // 130 146 162 178 ...
    tb = _mm512_shuffle_i32x4(rb, rf, 0x88); // 131 147 163 179 ...
    tc = _mm512_shuffle_i32x4(r8, rc, 0xdd); // 132 148 164 180 ...
    td = _mm512_shuffle_i32x4(r9, rd, 0xdd); // 133 149 165 181 ...
    te = _mm512_shuffle_i32x4(ra, re, 0xdd); // 134 150 166 182 ...
    tf = _mm512_shuffle_i32x4(rb, rf, 0xdd); // 135 151 167 183 ...

    r0 = _mm512_shuffle_i32x4(t0, t8, 0x88); //   0  16  32  48  64  80  96 112 ... 240
    r1 = _mm512_shuffle_i32x4(t1, t9, 0x88); //   1  17  33  49  66  81  97 113 ... 241
    r2 = _mm512_shuffle_i32x4(t2, ta, 0x88); //   2  18  34  50  67  82  98 114 ... 242
    r3 = _mm512_shuffle_i32x4(t3, tb, 0x88); //   3  19  35  51  68  83  99 115 ... 243
    r4 = _mm512_shuffle_i32x4(t4, tc, 0x88); //   4 ...
    r5 = _mm512_shuffle_i32x4(t5, td, 0x88); //   5 ...
    r6 = _mm512_shuffle_i32x4(t6, te, 0x88); //   6 ...
    r7 = _mm512_shuffle_i32x4(t7, tf, 0x88); //   7 ...
    r8 = _mm512_shuffle_i32x4(t0, t8, 0xdd); //   8 ...
    r9 = _mm512_shuffle_i32x4(t1, t9, 0xdd); //   9 ...
    ra = _mm512_shuffle_i32x4(t2, ta, 0xdd); //  10 ...
    rb = _mm512_shuffle_i32x4(t3, tb, 0xdd); //  11 ...
    rc = _mm512_shuffle_i32x4(t4, tc, 0xdd); //  12 ...
    rd = _mm512_shuffle_i32x4(t5, td, 0xdd); //  13 ...
    re = _mm512_shuffle_i32x4(t6, te, 0xdd); //  14 ...
    rf = _mm512_shuffle_i32x4(t7, tf, 0xdd); //  15  31  47  63  79  96 111 127 ... 255

    _mm512_store_epi32(&matT[ 0*16], r0);
    _mm512_store_epi32(&matT[ 1*16], r1);
    _mm512_store_epi32(&matT[ 2*16], r2);
    _mm512_store_epi32(&matT[ 3*16], r3);
    _mm512_store_epi32(&matT[ 4*16], r4);
    _mm512_store_epi32(&matT[ 5*16], r5);
    _mm512_store_epi32(&matT[ 6*16], r6);
    _mm512_store_epi32(&matT[ 7*16], r7);
    _mm512_store_epi32(&matT[ 8*16], r8);
    _mm512_store_epi32(&matT[ 9*16], r9);
    _mm512_store_epi32(&matT[10*16], ra);
    _mm512_store_epi32(&matT[11*16], rb);
    _mm512_store_epi32(&matT[12*16], rc);
    _mm512_store_epi32(&matT[13*16], rd);
    _mm512_store_epi32(&matT[14*16], re);
    _mm512_store_epi32(&matT[15*16], rf);   
    }
}

void gather(int *mat, int *matT, int rep) {
    int i,j;
    int index[16] __attribute__((aligned(64)));

    __m512i vindex;

    for(i=0; i<16; i++) index[i] = 16*i;
    for(i=0; i<256; i++) mat[i] = i;
    vindex = _mm512_load_epi32(index);

    for(i=0; i<rep; i++) {
        _mm512_store_epi32(&matT[ 0*16], _mm512_i32gather_epi32(vindex, &mat[ 0], 4));
        _mm512_store_epi32(&matT[ 1*16], _mm512_i32gather_epi32(vindex, &mat[ 1], 4));
        _mm512_store_epi32(&matT[ 2*16], _mm512_i32gather_epi32(vindex, &mat[ 2], 4));
        _mm512_store_epi32(&matT[ 3*16], _mm512_i32gather_epi32(vindex, &mat[ 3], 4));
        _mm512_store_epi32(&matT[ 4*16], _mm512_i32gather_epi32(vindex, &mat[ 4], 4));
        _mm512_store_epi32(&matT[ 5*16], _mm512_i32gather_epi32(vindex, &mat[ 5], 4));
        _mm512_store_epi32(&matT[ 6*16], _mm512_i32gather_epi32(vindex, &mat[ 6], 4));
        _mm512_store_epi32(&matT[ 7*16], _mm512_i32gather_epi32(vindex, &mat[ 7], 4));
        _mm512_store_epi32(&matT[ 8*16], _mm512_i32gather_epi32(vindex, &mat[ 8], 4));
        _mm512_store_epi32(&matT[ 9*16], _mm512_i32gather_epi32(vindex, &mat[ 9], 4));
        _mm512_store_epi32(&matT[10*16], _mm512_i32gather_epi32(vindex, &mat[10], 4));
        _mm512_store_epi32(&matT[11*16], _mm512_i32gather_epi32(vindex, &mat[11], 4));
        _mm512_store_epi32(&matT[12*16], _mm512_i32gather_epi32(vindex, &mat[12], 4));
        _mm512_store_epi32(&matT[13*16], _mm512_i32gather_epi32(vindex, &mat[13], 4));
        _mm512_store_epi32(&matT[14*16], _mm512_i32gather_epi32(vindex, &mat[14], 4));
        _mm512_store_epi32(&matT[15*16], _mm512_i32gather_epi32(vindex, &mat[15], 4));
    }
}

int verify(int *mat) {
    int i,j;
    int error = 0;
    for(i=0; i<16; i++) {
      for(j=0; j<16; j++) {
        if(mat[j*16+i] != i*16+j) error++;
      }
    }
    return error;
}

void print_mat(int *mat) {
    int i,j;
    for(i=0; i<16; i++) {
      for(j=0; j<16; j++) printf("%2X ", mat[i*16+j]);
      puts("");
    }
    puts("");
}

int main(void) {
    int i,j, rep;
    int mat[256] __attribute__((aligned(64)));
    int matT[256] __attribute__((aligned(64)));
    double dtime;

    rep = 10000000;
    for(i=0; i<256; i++) mat[i] = i;
    print_mat(mat);

    gather(mat, matT,1);
    for(i=0; i<256; i++) mat[i] = i;
    dtime = -omp_get_wtime();
    gather(mat, matT, rep);
    dtime += omp_get_wtime();
    printf("errors %d\n", verify(matT));
    printf("dtime %f\n", dtime);
    print_mat(matT);

    tran(mat,matT,1);
    dtime = -omp_get_wtime();
    tran(mat, matT, rep);
    dtime += omp_get_wtime();
    printf("errors %d\n", verify(matT));
    printf("dtime %f\n", dtime);
    print_mat(matT);
}

The gather function took 1.13 s and the tran function 0.8 s.

According to Agner Fog's Micro-architecture manual shuffle and permute instructions have poor performance with KNL. The shuffle and unpack instructions used in my original answer https://stackoverflow.com/a/29587984/2542702 have a reciprocal throughput of 2. I managed to greatly improve the performance by using vpermq instead which has a reciprocal throughput of 1. In additional I improved the first 1/4 of the transpose using using vinserti64x4 (see tran_new2 below). Here is a table of the times. The tran function takes 0.8 seconds and the tran_new2 function 0.46 s.

void tran_new2(int* mat, int* matT, int rep) {
  __m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;
  __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;

  int mask;
  int64_t idx1[8] __attribute__((aligned(64))) = {2, 3, 0, 1, 6, 7, 4, 5}; 
  int64_t idx2[8] __attribute__((aligned(64))) = {1, 0, 3, 2, 5, 4, 7, 6}; 
  int32_t idx3[16] __attribute__((aligned(64))) = {1, 0, 3, 2, 5 ,4 ,7 ,6 ,9 ,8 , 11, 10, 13, 12 ,15, 14};
  __m512i vidx1 = _mm512_load_epi64(idx1);
  __m512i vidx2 = _mm512_load_epi64(idx2);
  __m512i vidx3 = _mm512_load_epi32(idx3);

  int i;

  for(i=0; i<rep; i++) {

  t0 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 0*16+0])), _mm256_load_si256((__m256i*)&mat[ 8*16+0]), 1);
  t1 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 1*16+0])), _mm256_load_si256((__m256i*)&mat[ 9*16+0]), 1);
  t2 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 2*16+0])), _mm256_load_si256((__m256i*)&mat[10*16+0]), 1);
  t3 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 3*16+0])), _mm256_load_si256((__m256i*)&mat[11*16+0]), 1);
  t4 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 4*16+0])), _mm256_load_si256((__m256i*)&mat[12*16+0]), 1);
  t5 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 5*16+0])), _mm256_load_si256((__m256i*)&mat[13*16+0]), 1);
  t6 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 6*16+0])), _mm256_load_si256((__m256i*)&mat[14*16+0]), 1);
  t7 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 7*16+0])), _mm256_load_si256((__m256i*)&mat[15*16+0]), 1);

  t8 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 0*16+8])), _mm256_load_si256((__m256i*)&mat[ 8*16+8]), 1);
  t9 = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 1*16+8])), _mm256_load_si256((__m256i*)&mat[ 9*16+8]), 1);
  ta = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 2*16+8])), _mm256_load_si256((__m256i*)&mat[10*16+8]), 1);
  tb = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 3*16+8])), _mm256_load_si256((__m256i*)&mat[11*16+8]), 1);
  tc = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 4*16+8])), _mm256_load_si256((__m256i*)&mat[12*16+8]), 1);
  td = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 5*16+8])), _mm256_load_si256((__m256i*)&mat[13*16+8]), 1);
  te = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 6*16+8])), _mm256_load_si256((__m256i*)&mat[14*16+8]), 1);
  tf = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_load_si256((__m256i*)&mat[ 7*16+8])), _mm256_load_si256((__m256i*)&mat[15*16+8]), 1);

  mask= 0xcc;
  r0 = _mm512_mask_permutexvar_epi64(t0, (__mmask8)mask, vidx1, t4);
  r1 = _mm512_mask_permutexvar_epi64(t1, (__mmask8)mask, vidx1, t5);
  r2 = _mm512_mask_permutexvar_epi64(t2, (__mmask8)mask, vidx1, t6);
  r3 = _mm512_mask_permutexvar_epi64(t3, (__mmask8)mask, vidx1, t7);
  r8 = _mm512_mask_permutexvar_epi64(t8, (__mmask8)mask, vidx1, tc);
  r9 = _mm512_mask_permutexvar_epi64(t9, (__mmask8)mask, vidx1, td);
  ra = _mm512_mask_permutexvar_epi64(ta, (__mmask8)mask, vidx1, te);
  rb = _mm512_mask_permutexvar_epi64(tb, (__mmask8)mask, vidx1, tf);

  mask= 0x33;
  r4 = _mm512_mask_permutexvar_epi64(t4, (__mmask8)mask, vidx1, t0);
  r5 = _mm512_mask_permutexvar_epi64(t5, (__mmask8)mask, vidx1, t1);
  r6 = _mm512_mask_permutexvar_epi64(t6, (__mmask8)mask, vidx1, t2);
  r7 = _mm512_mask_permutexvar_epi64(t7, (__mmask8)mask, vidx1, t3);
  rc = _mm512_mask_permutexvar_epi64(tc, (__mmask8)mask, vidx1, t8);
  rd = _mm512_mask_permutexvar_epi64(td, (__mmask8)mask, vidx1, t9);
  re = _mm512_mask_permutexvar_epi64(te, (__mmask8)mask, vidx1, ta);
  rf = _mm512_mask_permutexvar_epi64(tf, (__mmask8)mask, vidx1, tb);

  mask = 0xaa;
  t0 = _mm512_mask_permutexvar_epi64(r0, (__mmask8)mask, vidx2, r2);
  t1 = _mm512_mask_permutexvar_epi64(r1, (__mmask8)mask, vidx2, r3);
  t4 = _mm512_mask_permutexvar_epi64(r4, (__mmask8)mask, vidx2, r6);
  t5 = _mm512_mask_permutexvar_epi64(r5, (__mmask8)mask, vidx2, r7);
  t8 = _mm512_mask_permutexvar_epi64(r8, (__mmask8)mask, vidx2, ra);
  t9 = _mm512_mask_permutexvar_epi64(r9, (__mmask8)mask, vidx2, rb);
  tc = _mm512_mask_permutexvar_epi64(rc, (__mmask8)mask, vidx2, re);
  td = _mm512_mask_permutexvar_epi64(rd, (__mmask8)mask, vidx2, rf);

  mask = 0x55;
  t2 = _mm512_mask_permutexvar_epi64(r2, (__mmask8)mask, vidx2, r0);
  t3 = _mm512_mask_permutexvar_epi64(r3, (__mmask8)mask, vidx2, r1);
  t6 = _mm512_mask_permutexvar_epi64(r6, (__mmask8)mask, vidx2, r4);
  t7 = _mm512_mask_permutexvar_epi64(r7, (__mmask8)mask, vidx2, r5);
  ta = _mm512_mask_permutexvar_epi64(ra, (__mmask8)mask, vidx2, r8);
  tb = _mm512_mask_permutexvar_epi64(rb, (__mmask8)mask, vidx2, r9);
  te = _mm512_mask_permutexvar_epi64(re, (__mmask8)mask, vidx2, rc);
  tf = _mm512_mask_permutexvar_epi64(rf, (__mmask8)mask, vidx2, rd);

  mask = 0xaaaa;
  r0 = _mm512_mask_permutexvar_epi32(t0, (__mmask16)mask, vidx3, t1);
  r2 = _mm512_mask_permutexvar_epi32(t2, (__mmask16)mask, vidx3, t3);
  r4 = _mm512_mask_permutexvar_epi32(t4, (__mmask16)mask, vidx3, t5);
  r6 = _mm512_mask_permutexvar_epi32(t6, (__mmask16)mask, vidx3, t7);
  r8 = _mm512_mask_permutexvar_epi32(t8, (__mmask16)mask, vidx3, t9);
  ra = _mm512_mask_permutexvar_epi32(ta, (__mmask16)mask, vidx3, tb);
  rc = _mm512_mask_permutexvar_epi32(tc, (__mmask16)mask, vidx3, td);
  re = _mm512_mask_permutexvar_epi32(te, (__mmask16)mask, vidx3, tf);    

  mask = 0x5555;
  r1 = _mm512_mask_permutexvar_epi32(t1, (__mmask16)mask, vidx3, t0);
  r3 = _mm512_mask_permutexvar_epi32(t3, (__mmask16)mask, vidx3, t2);
  r5 = _mm512_mask_permutexvar_epi32(t5, (__mmask16)mask, vidx3, t4);
  r7 = _mm512_mask_permutexvar_epi32(t7, (__mmask16)mask, vidx3, t6);
  r9 = _mm512_mask_permutexvar_epi32(t9, (__mmask16)mask, vidx3, t8);  
  rb = _mm512_mask_permutexvar_epi32(tb, (__mmask16)mask, vidx3, ta);  
  rd = _mm512_mask_permutexvar_epi32(td, (__mmask16)mask, vidx3, tc);
  rf = _mm512_mask_permutexvar_epi32(tf, (__mmask16)mask, vidx3, te);

  _mm512_store_epi32(&matT[ 0*16], r0);
  _mm512_store_epi32(&matT[ 1*16], r1);
  _mm512_store_epi32(&matT[ 2*16], r2);
  _mm512_store_epi32(&matT[ 3*16], r3);
  _mm512_store_epi32(&matT[ 4*16], r4);
  _mm512_store_epi32(&matT[ 5*16], r5);
  _mm512_store_epi32(&matT[ 6*16], r6);
  _mm512_store_epi32(&matT[ 7*16], r7);
  _mm512_store_epi32(&matT[ 8*16], r8);
  _mm512_store_epi32(&matT[ 9*16], r9);
  _mm512_store_epi32(&matT[10*16], ra);
  _mm512_store_epi32(&matT[11*16], rb);
  _mm512_store_epi32(&matT[12*16], rc);
  _mm512_store_epi32(&matT[13*16], rd);
  _mm512_store_epi32(&matT[14*16], re);
  _mm512_store_epi32(&matT[15*16], rf);
  int* tmp = mat;
  mat = matT;
  matT = tmp;
  }
}

Nice! In your previous answer you write that an 8x8 transpose +r/w uses 40 instructions. That is: 8 loads, 24 shuffles on execution port 5 and 8 stores. In Intel's document 64-ia-32-architectures-optimization-manual, paragraph 11.11.2 they replace 8 of these shuffles by 8 `vinsertf128` instructions with a memory operant. This leads to less port 5 pressure: 16 instructions on port 5. In fact the massive L1 bandwidth is used to reduce the bottleneck on port 5 . The result is a faster algorithm. Do you think that you can use a similar idea here to speed up the 16x16 transpose? — wim, Dec 21 '16 at 13:49
Not enough space is my previous comment: The document link is [here](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) — wim, Dec 21 '16 at 13:54
@wim thank you so much for the link! I looked at it quickly. When I created the 8x8 answer I did not think about port pressure just number of instructions. I will have to look into this and get back to you. — Z boson, Dec 21 '16 at 14:07
@wim: Nice idea. But based on Agner Fog's tables, I think KNL's `vinsert` with a memory source still needs the shuffle unit. It's based on Silvermont, very different from Haswell. Agner Fog's tables don't list a port for `vinsertf128` or the AVX512 variants of it, but like Haswell there appears to be only one shuffle unit. It's on FP0. `vinsertf32x4 z,z,m128/m256` are both one-per-clock throughput, not one per 0.5c like loads, so they might still be using the shuffle unit. Broadcasts are handled entirely by the load port, so `vbroadcastf64x4 z,m256` has one per 0.5c throughput. — Peter Cordes, Dec 21 '16 at 20:40
By comparison, SKL's `vinsertf128 y,y,m128,i` is 2 uops: load and p015. It has one per 0.5c throughput. The register-source version is 1 uop, but only runs on port 5. Maybe the memory version uses a broadcast-load uop so the ALU uop can be a blend instead of a shuffle. Large-granularity blends can run on every ALU port. This makes it better than a micro-fused single-uop shuffle for most cases. — Peter Cordes, Dec 21 '16 at 20:44
@PeterCordes Indeed, there is no port 5 on KNL. Shuffles go to the FP0 unit. From Agner's manual it is not clear which resources `vinsertf64x4` uses. But at least we can emulate KNL's `vinsertf64x4` by a `vbroadcastf6x4` load from memory plus a `vblendmpd`, which both have a throughput of one per 0.5c according to Agner Fog. `vblendmpd` runs on FP0 or FP1. So, as far as I can see (I am quite unfamiliar with KNL, I just started reading Agner's instruction tables on KNL), in two cycles we can do 2 shuffles on FP0 and an emulated `vinsertf64x4` on the memory port and on FP1. — wim, Dec 21 '16 at 22:29
@wim: oh right, good point that we can broadcast and blend separately. Presumably they didn't implement `vinsert` that way internally because even 2-uop instructions are microcoded on KNL and have horrible front-end throughput. (one per 7 clocks). All I know about KNL micro-optimization comes from Agner Fog's microarch guide + tables, so we're in the same boat on this one. — Peter Cordes, Dec 21 '16 at 22:34
@PeterCordes That's right! By the way, I just noticed that the throughput of vshufi32x4 is only one instruction per two cycles instead of my previous assumption of one per cycle — wim, Dec 21 '16 at 22:41
@wim: yeah, it looks like some shuffles aren't fully pipelined. According to Agner's docs, the frontend will probably still be the main bottleneck in a lot of cases (up to two 1-uop instructions per clock from up to 16 bytes of code. 7c to decode multi-uop instructions, and some other possible stalls). That makes it reasonable to not go nuts with the transistor budget on building super-fast shuffle units in KNL. Decoding can hardly keep the OOO core fed if there are any non-vector-ALU instructions mixed in (pure loads, pure stores, or integer stuff). — Peter Cordes, Dec 21 '16 at 22:53
So IDK if emulating a 1-uop `vinsert` with a 2-instruction sequence is worth it. If this code is bottlenecked on the front-end, then probably not. But if it isn't, then it could win. (And this might not be frontend bottlenecked, if it's mostly 1-uop shuffles and some of them aren't fully pipelined). The out-of-order window is *much* smaller than on SKL, so hopefully the compiler does a good job at scheduling those loads, e.g. mixing in some shuffles before emitting all 16 loads. — Peter Cordes, Dec 21 '16 at 23:00
@PeterCordes, I think it's may be more accurate to say KNL is based on Airmont. I am not 100% sure but I think Airmont is more than just a die shrink of Silvermont (22nm to 14nm). Agner says that "Knights Landing has a better out-of-order pipeline than ... the Silvermont that cannot execute floating point and vector instructions out of order". Maybe Airmont also has this feature? — Z boson, Dec 22 '16 at 09:56
@PeterCordes, according to [this link](https://en.wikichip.org/wiki/intel/microarchitectures/airmont) it's just a die shrink and a GPU upgrade. — Z boson, Dec 22 '16 at 09:59
@PeterCordes, okay the more I read about KNL I realized that it's it's own arch. E.g. it has no loop buffer unlike Silvermont. So it's most similar to Silvermont but nevertheless has important differences. — Z boson, Dec 22 '16 at 10:05
Interesting that they'd drop the loop buffer, I hadn't noticed the Silvermont had one. I wonder if this was a trade off of giving up power savings to save die area for more cores per chip? Maybe they figure most vector loops will be too big? Or maybe the uop format had to change to support AVX512, and making the loop buf support that was inconvenient? Or maybe it would have sucked with 4-way SMT? Sounds like it would be *really* nice to have a loop buf for the non-hyperthread case, though, since frontend bottlenecks are probably common with loop overhead + loads and stores. — Peter Cordes, Dec 22 '16 at 10:27
@PeterCordes If we neglect the loads and stores, then there are 64 shuffles which take at least: (64 shuffles * 2 cycles/shuffle on FP0 * 10e6 repeats) / 1.6e9 Hz = 0.8 seconds (turbo boost cpu freq. = 1.6GHz). Accidentally(?) `tran` also takes 0.8 seconds, which indicates that there might be a shuffle bottleneck on FP0. All other instructions are probably executed on other execution ports. — wim, Dec 22 '16 at 14:26
As far as I can see, all instructions involved here are 1-uop instructions. The frontend can do at most two 1-uop instructions per clock. In the compiled code (function `tran` in Godbold compiler, icc 17 -O3), I counted 97 1-uop instructions per loop (some memory addresses are read twice by vpunpck and not by any vmovups). The total amount of cycles by `tran` is 0.8 sec. * 1.6 Ghz =1.28e9 cycles. During these cycles about 97*1e7 instructions are executed. 97e7/1.28e9 = 0.76 instructions per cycle, which is much smaller than the frontend limit of 2. — wim, Dec 22 '16 at 14:27
So, the frontend is less likely to be the bottleneck here. Shuffles are relatively expensive on KNL. I still think that it might be possible to speedup `tran` a bit by replacing (e.g.) 16 shuffles (out of 64) by 16 `vinsertf64x4` or, if that doesn't work, by 16 `vbroadcastf64x4`+ 16 `vblendmpd`. — wim, Dec 22 '16 at 14:28
@wim, I will try and implement something with `vinsertf64x4` and `vbroadcastf64x4 + vblendmpd` tomorrow. — Z boson, Dec 22 '16 at 14:44
That would certainly be interesting, if you have time for such testing. At least the discussion above forces us to dig into KNL documents, which learns us something about KNL. :-) — wim, Dec 22 '16 at 15:51
I found a version that is dramatically faster using `_mm512_mask_permutexvar_epi64 (and epi32)` which has a throughput of 1 (instead of 2 with the shuffles). I also use `_mm512_mask_broadcast_i64x4` on the first eight lines which has a throughput of 0.5. I will post code later after a few more tests. I was not sure it would be faster because I need a vector index (actually three) and I am already using 32 AVX512 registers which means some spilling but nevertheless it's much faster. — Z boson, Dec 26 '16 at 20:02
Well, we happen to have a Xeon Phi 7210 in our lab. I just quickly tested your first snippet of code using the same compiler flags you specified. The result is 1.78s vs 1.37s. — lei_z, Dec 27 '16 at 03:25
@lei.april that sounds right I think. The AVX frequency of the 7210 is 1.1 GHz and the AVX frequency of the 7250 is 1.5 GHz so the 7210 is about 36% slower than the 7250 for AVX code http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-processor-overview.html — Z boson, Dec 27 '16 at 08:37
@lei.april I have a more optimized version of the 16x16 transpose for KNL which takes about 0.5 s compared to 0.8 s with the original code. — Z boson, Dec 27 '16 at 08:39
@wim, I forgot to tag your name in my comment above so you may have missed it. I'll post the code update soon. Changing subjects. I went through example 11-19 in the Intel manual transposing a 8x8 matrix. They use more instructions than me. They use 12 shuffles which they convert to 4 shuffles and eight blends. I use only 8 shuffles. I highly doubt eight blends and 4 shuffles is any better than 8 shuffles. Intel apparently did not realize this could be done in 24 operations. — Z boson, Dec 27 '16 at 09:34
@wim, I updated my 8x8 transpose answer based on the examples in the Intel Manual http://stackoverflow.com/a/25627536/2542702 — Z boson, Dec 27 '16 at 12:11
@Zboson This is a great update! It also surprises me that `_mm512_mask_permutexvar_epi64` is so much faster than the shuffles. From 0.8s to 0.46s is quite substantial. — wim, Dec 28 '16 at 09:29
Oh I just realized that this comment thread had covered almost the same thing I just posted under the first answer. lol — Mysticial, Jan 20 '17 at 04:45
To expand on my comments under the other answer, you do not need permute vectors for anything that stays within a 256-bit half. (thanks to `_mm512_mask_shuffle_epi32()` and `_mm512_mask_permutex_epi64`) And since you're doing the 256-wide part using loads, you actually don't need any permute vectors at all. Also some of the mask constants can be merged. For example, `0x3333` can be used to cover both 128-bit and 64-bit granularities if you do the 128-bit one using `_mm512_mask_permutex_epi64()` and the 64-bit one using `_mm512_mask_shuffle_epi32`. — Mysticial, Jan 20 '17 at 05:01
Is `tran_new2` still the fastest method? also, how about when you already have loaded the matrix in 16 zmm registers instead of memory location? I see that `tran_new2` is essentially the same as what Intel uses in https://github.com/intel/intel-ipsec-mb/blob/main/lib/include/transpose_avx512.asm#L204 which also relies on loading in this same layout. How about when the matrix is already filled in registers with the layout ri = {a_{i,15}, a_{i,14},..., a_{i,1}, a_{i,0}} ? — potuz, Dec 31 '21 at 15:58

How to transpose a 16x16 matrix using SIMD instructions?

2 Answers2

Linked