transpose 8x16 matrix on AVX 512

Question

I have a 8x16 uint32_t matrix already loaded in 8 zmm registers. They have the layout

zmmi = {ai_15, ai_14, ..., ai_1, ai_0}

where i goes from 0 to 7 and ai_j are 32 bit integers for each j from 0 to 15. I am looking for a fast way of transforming the layout of these registers so that I obtain something that is not strictly a transpose of the matrix, but end result I want is to have a buffer in memory containing the 128 = 8 x 16 integers ai_j in the following layout

a0_0, a1_0, a2_0,..., a7_0, a0_1, a1_1, ..., a7_1,...,a0_15, a1_15, ...,a7_15

I am aware of the 16x16 transpositions here and in Intel's implementation but both rely on a trick of loading the matrix from memory in a precise layout to avoid 128 lane crossings. I cannot change the fact that I already have the matrix loaded in that particular way in 8 registers as described above. I can do this "transposition" on registers by 8 vshufps (throughput 1 and latency 1 on Xeon KNL) and 16 vpermi2d (throughput 1 latency 3) together with their index loading. A quick napking computation convinced me that this was the fastest way of doing this on registers before storing to memory and also that this was faster than trading some vpermi2d for scatter store operations to memory. The handwaving way of convincing myself of this is that I start with registers containing 1 letter and 16 indices (the macro below clarifies this notation) and I want to end up with a register containing 8 letters and 2 indices. At each vshufps and vpermi2d I can only divide by two the number of indices and multiply by 2 the number of letters. But I am forced to cross 128 bit lanes at least twice to separate contiguous indices; thus I am forced to have two rounds of non simple shuffles. I am hoping someone here could point me to a better way of doing this or confirm if this is alright. Here's my macro on GNU assembler, the 8 arguments r0 --r7 contain the 8 zmm registers above, the 4 t0 -- t3 are temporary clobbered zmm registers (it may be done with less, but I don't care about this) and m0 and m1 are preloaded masks:

.LPSHUFFLE_TRANSPOSE_MASK3:
    .long   0x00000000, 0x00000002, 0x00000010, 0x00000012
    .long   0x00000001, 0x00000003, 0x00000011, 0x00000013
    .long   0x00000004, 0x00000006, 0x00000014, 0x00000016
    .long   0x00000005, 0x00000007, 0x00000015, 0x00000017
                
.LPSHUFFLE_TRANSPOSE_MASK4:
    .long   0x00000008, 0x0000000a, 0x00000018, 0x0000001a
    .long   0x00000009, 0x0000000b, 0x00000019, 0x0000001b
    .long   0x0000000c, 0x0000000e, 0x0000001c, 0x0000001e
    .long   0x0000000d, 0x0000000f, 0x0000001d, 0x0000001f

# Input
#
# r0 = {a15 a14 a13 a12 a11 a10 a09 a08 a07 a06 a05 a04 a03 a02 a01 a00}
# r1 = {b15 g14 g13 g12 g11 g10 g09 g08 g07 g06 g05 g04 g03 g02 g01 g00}
# r2 = {c15 c14 c13 c12 c11 c10 c09 c08 c07 c06 c05 c04 c03 c02 c01 c00}
# r3 = {d15 d14 d13 d12 d11 d10 d09 d08 d07 d06 d05 d04 d03 d02 d01 d00}
# r4 = {e15 e14 e13 e12 e11 e10 e09 e08 e07 e06 e05 e04 e03 e02 e01 e00}
# r5 = {f15 f14 f13 f12 f11 f10 f09 f08 f07 f06 f05 f04 f03 f02 f01 f00}
# r6 = {g15 g14 g13 g12 g11 g10 g09 g08 g07 g06 g05 g04 g03 g02 g01 g00}
# r7 = {h15 h14 h13 h12 h11 h10 h09 h08 h07 h06 h05 h04 h03 h02 h01 h00}
# 
# OUTPUT:
# 
# r0 = {h01 g01 f01 e01 d01 c01 b01 a01 h00 g00 f00 e00 d00 c00 b00 a00}
# r1 = {h03 g03 f03 e03 d03 c03 b03 a03 h00 g02 f02 e02 d02 c02 b02 a02}
# r2 = {h05 g05 f05 e05 d05 c05 b05 a05 h00 g00 f00 e00 d00 c00 b00 a04}
# r3 = {h07 g07 f07 e07 d07 c07 b07 a07 h00 g00 f00 e00 d00 c00 b00 a06}
# r4 = {h09 g09 f09 e09 d09 c09 b09 a09 h00 g00 f00 e00 d00 c00 b00 a08}
# r5 = {h11 g11 f11 e11 d11 c11 b11 a11 h10 g10 f10 e10 d10 c10 b10 a10}
# r6 = {h13 g13 f13 e13 d13 c13 b13 a13 h12 g12 f12 e12 d12 c12 b12 a12}
# r7 = {h15 g15 f15 e15 d15 c15 b15 a15 h14 g14 f14 e14 d14 c14 b14 a14}
#
# m0 and m1 come already loaded with .LPSHUFFLE_TRANSPOSE_MASK3 and
# .LPSHUFFLE_TRANSPOSE_MASK4
.macro TRANSPOSE_8x16_U32 r0, r1, r2, r3, r4, r5, r6, r7,\
            t0, t1, t2, t3, m0, m1

    # Permutations: 2 letters, 8 indices
    vmovdqa32   \t0, \m0
    vmovdqa32   \t1, \m0
    vpermi2d    \t0, \r0, \r4   // t0 = {e7 e5 a7 a5 e6 e4 a6 a4 e3 e1 a3 a1 e2 e0 a2 a0}
    vpermi2d    \t1, \r1, \r5   // t1 = {f7 f5 b7 b5 f6 f4 b6 b4 f3 f1 b3 b1 f2 f0 b2 b0}
    vmovdqa32   \t2, \m1
    vmovdqa32   \t3, \m1
    vpermi2d    \t2, \r0, \r4   // t2 = {e15 e13 a15 a13 e14 e12 a14 a12 e11 e9 a11 a9 e10 e8 a10 a8}
    vpermi2d    \t3, \r1, \r5   // t3 = {f15 f13 b15 b13 f14 f12 b14 b12 f11 f9 b11 b9 f10 f8 a10 a8}

    vmovdqa32   \r0, \m0
    vmovdqa32   \r1, \m0
    vpermi2d    \r0, \r2, \r6   // r0 = {g7 g5 c7 c5 g6 g4 c6 c4 g3 g1 c3 c1 g2 g0 c2 c0}
    vpermi2d    \r1, \r3, \r7   // r1 = {h7 h5 d7 d5 h6 h4 d6 d4 h3 h1 d3 d1 h2 h0 d2 d0}

    vmovdqa32   \r4, \m1
    vmovdqa32   \r5, \m1
    vpermi2d    \r4, \r2, \r6   // r4 = {g15 g13 c15 c13 g14 g12 c14 c12 g11 g9 c11 c9 g10 g8 c10 c8}
    vpermi2d    \r5, \r3, \r7   // r5 = {h15 h13 d15 d13 h14 h12 d14 d12 h11 h9 d11 d9 h10 h8 d10 d8}

    # Simple shuffles: 4 letters, 4 indices
    vshufps     \r6, \t0, \t1, 0x88 // r6 = {f5 b5 e5 a5 f4 b4 e4 a4 f1 b1 e1 a1 f0 b0 e0 a0}
    vshufps     \r7, \t0, \t1, 0xDD // r7 = {f7 b7 e7 a7 f6 b6 e6 a6 f3 b3 e3 a3 f2 b2 e2 a2}
    vshufps     \t1, \t2, \t3, 0x88 // t1 = {f13 b13 e13 a13 f12 b12 e12 a12 f9 b9 e9 a9 f8 b8 e8 a8}
    vshufps     \t0, \t2, \t3, 0xDD // t0 = {f15 b15 e15 a15 f14 b14 e14 a14 f11 b11 e11 a11 f10 b10 e10 a10}
    vshufps     \t2, \r4, \r5, 0x88 // t2 = {h13 d13 g13 c13 h12 d12 g12 c12 h9 d9 g9 c9 h8 d8 g8 c8}
    vshufps     \t3, \r4, \r5, 0xDD // t3 = {h15 d15 g15 c15 h14 d14 g14 c14 h11 d11 g11 c11 h10 d10 g10 c10}
    vshufps     \r4, \r0, \r1, 0x88 // r4 = {h5 d5 g5 c5 h4 d4 g4 c4 h1 d1 g1 c1 h0 d0 g0 c0}
    vshufps     \r5, \r0, \r1, 0xDD // r5 = {h7 d7 g7 c7 h6 d6 g6 c6 h3 d3 g3 c3 h2 d2 g2 c2}

    # Final permutations: 2 letters, 8 indices
    vmovdqa32   \r0, \m0
    vmovdqa32   \r1, \m0 
    vpermi2d    \r0, \r6, \r4   // r0 = {h1 g1 f1 e1 d1 c1 b1 a1 h0 g0 f0 e0 d0 c0 b0 a0}
    vpermi2d    \r1, \r7, \r5   // r1 = {h3 g3 f3 e3 d3 c3 b3 a3 h2 g2 f2 e2 d2 c2 b2 a2}
    vmovdqa32   \r2, \m1
    vmovdqa32   \r3, \m1 
    vpermi2d    \r2, \r6, \r4   // r2 = {h5 g5 f5 e5 d5 c5 b5 a5 h4 g4 f4 e4 d4 c4 b4 a4}
    vpermi2d    \r3, \r7, \r5   // r3 = {h7 g7 f7 e7 d7 c7 b7 a7 h6 g6 f6 e6 d6 c6 b6 a6}


    vmovdqa32   \r4, \m0
    vmovdqa32   \r5, \m0 
    vpermi2d    \r4, \t1, \t2   // r4 = {h9 g9 f9 e9 d9 c9 b9 a9 h8 g8 f8 e8 d8 c8 b8 a8}
    vpermi2d    \r5, \t0, \t3   // r5 = {h11 g11 f11 e11 d11 c11 b11 a11 h10 g10 f10 e10 d10 c10 b10 a10}
    vmovdqa32   \r6, \m1
    vmovdqa32   \r7, \m1 
    vpermi2d    \r6, \t1, \t2   // r6 = {h13 g13 f13 e13 d13 c13 b13 a13 h12 g12 f12 e12 d12 c12 b12 a12}
    vpermi2d    \r7, \t0, \t3   // r7 = {h15 g15 f15 e15 d15 c15 b15 a15 h14 g14 f14 e14 d14 c14 b14 a14}
.endm

Right, you definitely don't want scatter stores. If the data starts in memory, it's worth considering loading in 256-bit halves with `vmovups ymm` / `vinsertf32x8 zmm, [mem], 1`. CPUs other than KNL have 2/clock load throughput. (Ice Lake and later have 2/clock stores, but can only commit to cache at 2/clock when a pair of consecutive stores both write the same cache line. 2/clock loads are more flexible and are a thing even on Skylake-avx512 CPUs.) — Peter Cordes, Jan 02 '22 at 20:58

transpose 8x16 matrix on AVX 512

0 Answers0