Transpose 8x8 64-bits matrix

Question

Targeting AVX2, what is a fastest way to transpose a 8x8 matrix containing 64-bits integers (or doubles)?

I searched though this site and I found several ways of doing 8x8 transpose but mostly for 32-bits floats. So I'm mainly asking because I'm not sure whether the principles that made those algorithms fast readily translate to 64-bits and second, apparently AVX2 only has 16 registers so only loading all the values would take up all the registers.

One way of doing it would be to call 2x2 _MM_TRANSPOSE4_PD but I was wondering whether this is optimal:

  #define _MM_TRANSPOSE4_PD(row0,row1,row2,row3)                \
        {                                                       \
            __m256d tmp3, tmp2, tmp1, tmp0;                     \
                                                                \
            tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0);       \
            tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF);       \
            tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0);       \
            tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF);       \
                                                                \
            (row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20);  \
            (row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20);  \
            (row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31);  \
            (row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31);  \
        }

Still assuming AVX2, is transposing double[8][8] and int64_t[8][8] largely the same, in principle?

PS: And just being curious, having AVX512 would change the things substantially, correct?

Do your inputs come from memory or registers? Do you need the output in memory or registers, or just process them? (You may not need to explicitly transpose your data at all). — chtz, Mar 23 '21 at 15:52
Depending on what you want, this could be a duplicate: https://stackoverflow.com/questions/58454741/simd-transpose-when-row-size-is-greater-than-vector-width — chtz, Mar 23 '21 at 15:55
@chtz They come from memory and should go to memory. Sorry, forgot to add that. — Ecir Hana, Mar 23 '21 at 16:03

Soonts · Accepted Answer · 2021-03-24T05:19:49.750

After some thoughts and discussion in the comments, I think this is the most efficient version, at least when source and destination data is in RAM. It does not require AVX2, AVX1 is enough.

The main idea, modern CPUs can do twice as many load micro-ops compared to stores, and on many CPUs loading stuff into higher half of vectors with vinsertf128 has same cost as regular 16-byte load. Compared to your macro, this version no longer needs these relatively expensive (3 cycles of latency on most CPUs) vperm2f128 shuffles.

struct Matrix4x4
{
    __m256d r0, r1, r2, r3;
};

inline void loadTransposed( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
    // Load top half of the matrix into low half of 4 registers
    __m256d t0 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) );     // 00, 01
    __m256d t1 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 02, 03
    rsi += stride;
    __m256d t2 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) );     // 10, 11
    __m256d t3 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 12, 13
    rsi += stride;
    // Load bottom half of the matrix into high half of these registers
    t0 = _mm256_insertf128_pd( t0, _mm_loadu_pd( rsi ), 1 );    // 00, 01, 20, 21
    t1 = _mm256_insertf128_pd( t1, _mm_loadu_pd( rsi + 2 ), 1 );// 02, 03, 22, 23
    rsi += stride;
    t2 = _mm256_insertf128_pd( t2, _mm_loadu_pd( rsi ), 1 );    // 10, 11, 30, 31
    t3 = _mm256_insertf128_pd( t3, _mm_loadu_pd( rsi + 2 ), 1 );// 12, 13, 32, 33

    // Transpose 2x2 blocks in registers.
    // Due to the tricky way we loaded stuff, that's enough to transpose the complete 4x4 matrix.
    mat.r0 = _mm256_unpacklo_pd( t0, t2 ); // 00, 10, 20, 30
    mat.r1 = _mm256_unpackhi_pd( t0, t2 ); // 01, 11, 21, 31
    mat.r2 = _mm256_unpacklo_pd( t1, t3 ); // 02, 12, 22, 32
    mat.r3 = _mm256_unpackhi_pd( t1, t3 ); // 03, 13, 23, 33
}

inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
    _mm256_storeu_pd( rdi, mat.r0 );
    _mm256_storeu_pd( rdi + stride, mat.r1 );
    _mm256_storeu_pd( rdi + stride * 2, mat.r2 );
    _mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}

// Transpose 8x8 matrix of double values
void transpose8x8( double* rdi, const double* rsi )
{
    Matrix4x4 block;
    // Top-left corner
    loadTransposed( block, rsi );
    store( block, rdi );

#if 1
    // Using another instance of the block to support in-place transpose
    Matrix4x4 block2;
    loadTransposed( block, rsi + 4 );       // top right block
    loadTransposed( block2, rsi + 8 * 4 ); // bottom left block

    store( block2, rdi + 4 );
    store( block, rdi + 8 * 4 );
#else
    // Flip the #if if you can guarantee ( rsi != rdi )
    // Performance is about the same, but this version uses 4 less vector registers,
    // slightly more efficient when some registers need to be backed up / restored.
    assert( rsi != rdi );
    loadTransposed( block, rsi + 4 );
    store( block, rdi + 8 * 4 );

    loadTransposed( block, rsi + 8 * 4 );
    store( block, rdi + 4 );
#endif
    // Bottom-right corner
    loadTransposed( block, rsi + 8 * 4 + 4 );
    store( block, rdi + 8 * 4 + 4 );
}

For completeness, here’s a version which uses the code very similar to your macro, does twice as few loads, same count of stores, and more shuffles. Have not benchmarked but I would expect it to be slightly slower.

struct Matrix4x4
{
    __m256d r0, r1, r2, r3;
};

inline void load( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
    mat.r0 = _mm256_loadu_pd( rsi );
    mat.r1 = _mm256_loadu_pd( rsi + stride );
    mat.r2 = _mm256_loadu_pd( rsi + stride * 2 );
    mat.r3 = _mm256_loadu_pd( rsi + stride * 3 );
}

inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
    _mm256_storeu_pd( rdi, mat.r0 );
    _mm256_storeu_pd( rdi + stride, mat.r1 );
    _mm256_storeu_pd( rdi + stride * 2, mat.r2 );
    _mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}

inline void transpose( Matrix4x4& m4 )
{
    // These unpack instructions transpose lanes within 2x2 blocks of the matrix
    const __m256d t0 = _mm256_unpacklo_pd( m4.r0, m4.r1 );
    const __m256d t1 = _mm256_unpacklo_pd( m4.r2, m4.r3 );
    const __m256d t2 = _mm256_unpackhi_pd( m4.r0, m4.r1 );
    const __m256d t3 = _mm256_unpackhi_pd( m4.r2, m4.r3 );
    // Produce the transposed matrix by combining these blocks
    m4.r0 = _mm256_permute2f128_pd( t0, t1, 0x20 );
    m4.r1 = _mm256_permute2f128_pd( t2, t3, 0x20 );
    m4.r2 = _mm256_permute2f128_pd( t0, t1, 0x31 );
    m4.r3 = _mm256_permute2f128_pd( t2, t3, 0x31 );
}

// Transpose 8x8 matrix with double values
void transpose8x8( double* rdi, const double* rsi )
{
    Matrix4x4 block;
    // Top-left corner
    load( block, rsi );
    transpose( block );
    store( block, rdi );

    // Using another instance of the block to support in-place transpose, with very small overhead
    Matrix4x4 block2;
    load( block, rsi + 4 );     // top right block
    load( block2, rsi + 8 * 4 ); // bottom left block

    transpose( block2 );
    store( block2, rdi + 4 );
    transpose( block );
    store( block, rdi + 8 * 4 );

    // Bottom-right corner
    load( block, rsi + 8 * 4 + 4 );
    transpose( block );
    store( block, rdi + 8 * 4 + 4 );
}

Can you load with `vmovdqu` / `vinsertf128` to set up for 256-bit stores, instead of 256-bit loads and split stores? Most modern CPUs can do 2 loads per clock but only 1 store per clock (Ice Lake is 2 each). On Zen 2, `vinsertf128 y,mem` is single-uop for the front-end. (On Intel it's 2, and needs an ALU uop in the back end as well as the load, but that uop can run on any vector ALU port, probably like a broadcast-load feeding a blend). — Peter Cordes, Mar 24 '21 at 03:06
Why not show the probably-better version first, and move the old version down? Your answer does seem to agree with me that that's the better version, so IMO it should be at the top, or just after the 4x4 way like the other 8x2 way, where future readers will see those 2 options first. Then you can show the more-stores way and talk about why it's probably worse. Or just remove the code and mention that doing 2x as many stores is worse, except maybe when your destination can't be aligned by 32. — Peter Cordes, Mar 24 '21 at 04:06
@PeterCordes Have not tested and it’s hard to do (depends on CPU and surrounding code), but yeah, I tend to agree. Updated. — Soonts, Mar 24 '21 at 04:53
Thanks! One more question about the loads/stores: if one wanted to transpose even larger matrix using this 8x8 function, is it better to iterate through the matrix left-to-right then top-bottom or top-bottom first and then left-to-right? IOW, does it matter whether I load in strides or store in strides, so to say? — Ecir Hana, Mar 24 '21 at 07:21
@PeterCordes If you don't mind, originally you made a comment about AVX512, what was it about? That it is possible to do one 4x4 transpose using just single instruction? — Ecir Hana, Mar 24 '21 at 07:23
@EcirHana It’s generally better to do sequential reads + random writes than the opposite. If you wanna transpose large matrix using that `loadTransposed` function for 4x4 blocks, read the blocks left to right in the inner loop, top to bottom in the outer loop. — Soonts, Mar 24 '21 at 09:54
@Soonts: really? For a small burst, yeah the store buffer can do additional decoupling from execution to avoid stalling (vs. just the ROB / RS for loads), but I wouldn't say *generally*. If there's time for a cache line to be evicted between the first and second times you touch it, read is cheaper than write. A write needs to fill the cache line first (RFO = read for ownership), and eventually write-back the dirty data. — Peter Cordes, Mar 24 '21 at 11:37
@EcirHana: In my initial comment, I was mixing up 8x 8-byte (64 bytes = one AVX-512 vector) with 8x8 * 8 bytes (256 bytes = 4 AVX-512 vectors). That's why I deleted it. But yeah, for a 4x4 matrix, 128 bytes fits in 2 vectors. Two `vpermt2ps` shuffles can do a lane-crossing interleave that selects the elements you want from the source data, given the right control vectors. https://www.felixcloutier.com/x86/vpermt2w:vpermt2d:vpermt2q:vpermt2ps:vpermt2pd / `_mm512_permutex2var_pd`. So it's one shuffle per output vector, with both loads and stores being full width. — Peter Cordes, Mar 24 '21 at 11:40
@PeterCordes I understand why random writes are slower than sequential ones, it's just random reads are so bad. Unless cached, a pipeline stall is very likely to happen. When I messed with various RAM bound code, I don't remember a single time when trading random reads for random writes wasn't a good idea. — Soonts, Mar 24 '21 at 20:31
Interesting, I hadn't played with any real experiments. Perhaps if the overall system was bound on bandwidth, not latency of individual cores, things would be different. In this case the area that you're scattering into is small so you're unlikely to have a line evicted between stores, in which case yeah you want sequential reads to kick of L1d prefetching ASAP, and let the store buffer absorb the writes. Were the other cases you played with also involving small areas, or were some more widely scattered? — Peter Cordes, Mar 24 '21 at 21:20

Peter Cordes · Answer 2 · 2021-03-24T17:23:27.820

For small matrices where more than 1 row can fit in a single SIMD vector, AVX-512 has very nice 2-input lane-crossing shuffles with 32-bit or 64-bit granularity, with a vector control. (Unlike _mm512_unpacklo_pd which is basically 4 separate 128-bit shuffles.)

A 4x4 double matrix is "only" 128 bytes, two ZMM __m512d vectors, so you only need two vpermt2ps (_mm512_permutex2var_pd) to produce both output vectors: one shuffle per output vector, with both loads and stores being full width. You do need control vector constants, though.

Using 512-bit vector instructions has some downsides (clock speed and execution port throughput), but if your program can spend a lot of time in code that uses 512-bit vectors, there's probably a significant throughput gain from throwing around more data with each instruction, and having more powerful shuffles.

With 256-bit vectors, vpermt2pd ymm would probably not be useful for a 4x4, because for each __m256d output row, each of the 4 elements you want comes from a different input row. So one 2-input shuffle can't produce the output you want.

I think lane-crossing shuffles with less than 128-bit granularity aren't useful unless your matrix is small enough to fit multiple rows in one SIMD vector. See How to transpose a 16x16 matrix using SIMD instructions? for some algorithmic complexity reasoning about 32-bit elements - an 8x8 xpose of 32-bit elements with AVX1 is about the same as an 8x8 of 64-bit elements with AVX-512, where each SIMD vector holds exactly one whole row.

So no need for vector constants, just immediate shuffles of 128-bit chunks, and unpacklo/hi

Transposing an 8x8 with 512-bit vectors (8 doubles) would have the same problem: each output row of 8 doubles needs 1 double from each of 8 input vectors. So ultimately I think you want a similar strategy to Soonts' AVX answer, starting with _mm512_insertf64x4(v, load, 1) as the first step to get the first half of 2 input rows into one vector.

(If you care about KNL / Xeon Phi, @ZBoson's other answer on How to transpose a 16x16 matrix using SIMD instructions? shows some interesting ideas using merge-masking with 1-input shuffles like vpermpd or vpermq, instead of 2-input shuffles like vunpcklpd or vpermt2pd)

Using wider vectors means fewer loads and stores, and maybe even fewer total shuffles because each one combines more data. But you also have more shuffling work to do, to get all 8 elements of a row into one vector, instead of just loading and storing to different places in chunks half the size of a row. It's not obvious is better; I'll update this answer if I get around to actually writing the code.

Note that Ice Lake (first consumer CPU with AVX-512) can do 2 loads and 2 stores per clock. It has better shuffle throughput than Skylake-X for some shuffles, but not for any that are useful for this or Soonts' answer. (All of vperm2f128, vunpcklpd and vpermt2pd only run on port 5, for the ymm and zmm versions. https://uops.info/. vinsertf64x4 zmm, mem, 1 is 2 uops for the front-end, and needs a load port and a uop for p0/p5. (Not p1 because it's a 512-bit uop, and see also SIMD instructions lowering CPU frequency).)

Transpose 8x8 64-bits matrix

2 Answers2