SIMD transpose when row size is greater than vector width

Question

You can find a lot of good answers for transposing a matrix which falls with the natural size of the SIMD instruction set, in particular, where the size of one row is no more than the vector width. Examples would be a 4x4 float transpose in SSE, or a 4x4 double or 8x8 float transpose in AVX/AVX2 (double everything again for AVX-512).

However, what are the options when the matrix is larger than that? E.g., a 16x16 float matrix using AVX2? Can SIMD shuffles be used at all to speed things up, or is a gather + sequential write the only way?

score 3 · Answer 1 · answered Oct 18 '19 at 17:57

If all your matrix dimensions are a multiple of your packet-size you can do the operation block-wise and swap the blocks as needed. Example for 4x4 double matrix using SSE2:

// transpose vectors i0 and i1 and store the result to addresses r0 and r1
void transpose2x2(double *r0, double* r1, __m128d i0, __m128d i1)
{
    __m128d t0 = _mm_unpacklo_pd(i0,i1);
    __m128d t1 = _mm_unpackhi_pd(i0,i1);
    _mm_storeu_pd(r0, t0);
    _mm_storeu_pd(r1, t1);
}


void transpose(double mat[4][4])
{
    // transpose [00]-block in-place
    transpose2x2(mat[0]+0, mat[1]+0,_mm_loadu_pd(mat[0]+0),_mm_loadu_pd(mat[1]+0));

    // load [20]-block
    __m128d t20 = _mm_loadu_pd(mat[2]+0), t30 = _mm_loadu_pd(mat[3]+0);
    // transpose [02]-block and store it to [20] position
    transpose2x2(mat[2]+0,mat[3]+0, _mm_loadu_pd(mat[0]+2),_mm_loadu_pd(mat[1]+2));
    // transpose temp-block and store it to [02] position
    transpose2x2(mat[0]+2,mat[1]+2, t20, t30);

    // transpose [22]-block in-place
    transpose2x2(mat[2]+2, mat[3]+2,_mm_loadu_pd(mat[2]+2),_mm_loadu_pd(mat[3]+2));
}

This should be relatively easy to extend to other square matrices, other scalar types and other architectures. Matrices which are not a multiple of packet sizes are perhaps more complicated (if they are large enough, it will probably be worth it to do most the work with vectorization and just do the last rows/columns manually).

For some sizes, e.g. 3x4 or 3x8 matrices there are special algorithms [1] -- if you have a 1003x1003 matrix, you could exploit that for the last rows/columns (and there are probably algorithms for other odd sizes as well).

With some effort you could also write this for rectangular matrices (some thoughts have to be made how to avoid having to cache more than one block at a time, but it is possible).

Godbolt demo: https://godbolt.org/z/tVk_Bc

[1] https://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx

Thanks! Is it fair to say that for large matrices (say 100x10,000) this will still give a significant speedup over scalar or gather-based load-store? — BeeOnRope, Oct 18 '19 at 18:21
I guess the relative speedup should be independent of the size (as long as no effects like caching come into effect -- I guess a scalar algorithm with good cache locality will beat a badly implemented blockwise-transpose). But to be sure, you need to benchmark this (the rectangular case may probably be a bit more complicated than I first thought, especially if you don't know the sizes at compile time). — chtz, Oct 18 '19 at 19:21

score -1 · Answer 2 · answered Oct 19 '19 at 00:21

-1

Perhaps one could use the fortran TRANSPOSE intrisic with ISO_C_BINDING and link that with C as a subroutine or function call.

TRANSPOSE is pretty optimised in fortran.

And mixed language skills are sometime useful to know generically. I have even linked F90 with GO.

answered Oct 19 '19 at 00:21

Holmz

714
7
14

2

Ok, but what asm does it end up running that makes it go fast, on any given Fortran implementation? If we knew that, we could also implement it in C with intrinsics or in hand-written asm. – Peter Cordes Oct 19 '19 at 01:10
2

gfortran seems to be optimizing `TRANSPOSE` into efficient SIMD code, if the matrix is square and the size is a multiple of four/eight (known at compile-time). For every other case I looked at, I only found it generating a trivial nested loop: https://godbolt.org/z/Uiv1iR -- also I did not manage to have it generate an in-place transpose without allocating a temporary (but that might be due to my limited fortran skills ...) – chtz Oct 20 '19 at 21:40
It is not uncommon to use features in a language that are easy to use. If mixed language may not be to everyone's liking, but it is fairly common. – Holmz Oct 24 '19 at 12:09
For a small matrix, like 4x4 doubles, the overhead of calling another function instead of just inlining a few SIMD instructions is significant. – Peter Cordes Mar 23 '21 at 16:43

SIMD transpose when row size is greater than vector width

2 Answers2

Linked