Matrix transpose with AVX-512 encounters memory address issues

Question

I'm trying to use avx-512 to do matrix transpose. But the matrix still can’t be transposed. When I try to tranpose a 16*16 matrix, the code will run i=0 and j=0~15. I think the problem is related to matrix element address (such as &temp_re[j * rowA + i], &matA_re[i * colA + j]). I hope someone can tell me how to modify the matrix element address, so the function can transpose any size of matrix.

Code:

#define AVX 16
//--------------------
#include <immintrin.h>
// C
#include <complex.h>
#include <assert.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <math.h>
void matrix_transpose_avx2(float *matA_re, float *matA_im, int rowA, int colA) 
{
    float *temp_re = (float *)malloc(rowA * colA * sizeof(float));
    float *temp_im = (float *)malloc(rowA * colA * sizeof(float));
    memcpy(temp_re, matA_re, (rowA * colA * sizeof(float)));
    memcpy(temp_im, matA_im, (rowA * colA * sizeof(float)));
    
    int remaining = rowA % 16;  // Elements that can't be processed in AVX512 vectors 
    int row_limit = rowA - remaining;

    __m512 re_vec, im_vec;

    for (int i = 0; i < (row_limit/AVX) ; i++)
    {
        for (int j = 0; j < colA; ++j)
        {
            re_vec = _mm512_loadu_ps(&temp_re[j * rowA + i * AVX]);
            im_vec = _mm512_loadu_ps(&temp_im[j * rowA + i * AVX]);
            _mm512_storeu_ps(&matA_re[i * AVX * colA + j], re_vec);
            _mm512_storeu_ps(&matA_im[i * AVX * colA + j], im_vec);
            for (int k = 0; k < AVX; k++) 
            {
              printf("matA_re[%d] = %.2f, ", i + j * AVX + k, matA_re[i + j * AVX + k]);
              printf("matA_im[%d] = %.2f\n", i + j * AVX + k, matA_im[i + j * AVX + k]);
            }
        }
    }

    // Process remaining elements
    for (int i = row_limit; i < rowA; ++i)
    {
        for (int j = 0; j < colA; ++j)
        {
            matA_re[i * colA + j] = temp_re[j * rowA + i];
            matA_im[i * colA + j] = temp_im[j * rowA + i];
        }
    }
    

    free(temp_re);
    free(temp_im);
}

I really need Proficient coder's help. Thanks in advance.

It looks like you haven't understood how to use SIMD operations that load/store multiple floats at once since this looks like it'll still access out of bounds (`j * rowA + i * AVX` with `j=colA-1` and `i=max` will access from `j * rowA + i * AVX + 0..15`, which can be past the end of the array if `colA` is small). And you don't swap within 64-byte chunks you copy. Some example matrix transposes with narrower vectors include [SIMD transpose when row size is greater than vector width](https://stackoverflow.com/q/58454741) for a 4x4 with 2-element vectors — Peter Cordes, Aug 17 '23 at 02:56
And [How to transpose a 16x16 matrix using SIMD instructions?](https://stackoverflow.com/q/29519222) for 16x16 with 32-bit elements using AVX-512. This is the building block you should be using for wider matrices (or an FP-shuffle `__m512` version of it instead of `__m512i`), if their dimensions are >= 16x16. Swap whole chunks and transpose within those chunks. — Peter Cordes, Aug 17 '23 at 02:56
Or for example google for `simd matrix transpose` found https://github.com/yatsukha/sse-matrix-transpose/blob/3be77d89ece930e9795610e558912f964eae1539/matrix_util.hpp#L19 which transposes arbitrary-size `float` matrices using SSE2 by loading 4 vector of 4 floats each and doing a 4x4 transpose (same idea as with that 16x16 building-block Iinked, but smaller). Note that both row and column loop bounds need to be rounded down to the vector width, and cleanup loops are needed in multiple places (end of each row, and for the last few rows. For AVX-512, maybe use 128-bit vector cleanup then scalar). — Peter Cordes, Aug 17 '23 at 03:06
You appear to hoping someone will just write the code for you. SO is not a free code writing service. Matrix transpose is fast compared to matrix multiply so there's often little point trying to optimise the transpose step anyway. I suggest you start with the SSE example I showed you and deal with all the edge cases. There's a good chance it will be fast enough for whatever is your use case. If this is an exercise, then you need to figure out how to do it yourself. We can help, but we're not here to do it for you for free. Check out LinkedIn or somewhere if you want to hire someone. — Simon Goater, Aug 17 '23 at 10:25
@PeterCordes Corde Thank you for your suggestion, but my goal is to transpose any size of matrix, so _mm512_shuffle_ps( ) is probably not suitable for me to use in my code above the _mm512_storeu_ps( ) and _mm512_loadu_ps( ) . Do you have any other recommended AVX-512 instructions that may be helpful for matrix transpose? — rcheni, Aug 17 '23 at 11:20
`_mm256_shuffle_ps` and `_mm_shuffle_ps` are also useful for blocks too small for a 16x16, i.e. vectorizing the "cleanup" loops that handle parts of matrices left over after doing what you can with 16x16 blocks. Probably also `vpermt2ps` for a lane-crossing shuffle that selects any elements you want from 2 input vectors, and/or maybe `vpermps` to shuffle one vector arbitrarily, if you really want to vectorize more of the cleanup code. — Peter Cordes, Aug 17 '23 at 15:47
@PeterCordes Thank you for your suggestion, I will take your comment into consideration. Thank you for all your assistance. — rcheni, Aug 18 '23 at 08:35

Matrix transpose with AVX-512 encounters memory address issues

0 Answers0

Linked

Related