CUDA-matrix-transpose-in-place-implementation for NxN matrices that iterates over upper traingular entries only

Question

I am working on a matrix-transpose in-place-implementation, with the following idea: Let the indices only take values that correspond to upper triangular entries (since diagonal ones stay constant during the transformation) and exchange values A[i,j] and A[j,i] within one thread via a temporary integer.

I am having trouble to assign the correct next index pair A[i_next,j_next] to the threat currently working on A[i,j] (this corresponds to the last two lines in the code snippet). How do I have to modify these two lines in order to access only the correct matrix entries for each threat?

Example: for N=5; total-threads=3;

This is a 5x5 matrix, where the entries hold the thread_id of the threat they shall be accessed by. Those belonging to the first thread are italic as well (just to show which indexes I want row_idx, col_idx to assume for that specific thread.

Col0	Col1	Col2	Col3	Col4
/	1	2	3	1
/	/	2	3	1
/	/	/	2	3
/	/	/	/	1
/	/	/	/	/

__global__ void in_place_transpose(double *A, int N)
{
    int t_idx = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int total_threads = blockDim.x * gridDim.x;
    int row_idx = t_idx / N;
    int col_idx = t_idx % N;

    // only access upper triangular entries
    if (row_idx < col_idx)
    {
        while (row_idx < (N - 1) && col_idx < N) // A[N-1,N] is the last entry we change
        {
            int to = row_idx * N + col_idx;
            int from = col_idx * N + row_idx;

            if (from != to) // daigonal is constant
            {
                // printf("from: %d, to: %d \n", from, to); // show operation pair
                // printf("th: %d, row: %d, col: %d \n", t_idx, row_idx, col_idx); // show indizes
                int temp = A[to];
                A[to] = A[from];
                A[from] = temp;
            }
            // assign next matrix entry to thread (FAULTY)
            row_idx = (int)((row_idx + total_threads) % (N - row_idx - 1));
            col_idx = (int)(col_idx + (total_threads % (N - row_idx - 1)) % (N - row_idx - 1));
            
        }
    }
}

Think blockwise. A group of threads (e.g. a warp) reads a square block (e.g. 32x32) and its transposed and exchanges them by shared memory (e.g. 32x33). In this way you can keep the global memory accesses coalesced and the shared memory free of bank-conflicts. If you only work with small matrices, you do not have to consider this. — Sebastian, Oct 25 '22 at 16:52
For your indexing you have to translate a linear index in triangle indices. Simplest case (without consideration for speed) is to assign the indices like for a square and discard the diagonal and lower triangle. You do not need the next index, just generate a large enough grid to produce enough threads. — Sebastian, Oct 25 '22 at 16:55
For a direct calculation of triangle row and col from a given threadIdx, calculate `row=N-1-round(sqrt(N*N-N-1-2*threadIdx))` `col=threadIdx-row*(2*N-(1+row))/2` The number of needed threads is `N*(N-1)/2` — Sebastian, Oct 25 '22 at 17:38
To spell Sebastian's remark out differently: Even if you find a way to calculate the right indices, performance will be very bad because your accesses to global memory wont be [coalesced](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/), which is critical for bandwidth-limited kernels like this. See [here](https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/) for a good implementation of transpose. — paleonix, Oct 26 '22 at 10:53
Thanks for the remakrs! @Sebastian I should have mentioned that I want this to work for N=10^8, so creating a grid with enough threads isn't an option unfortunately. Regarding performance, you are right of course, this idea is not very fast.. However, I still want to make it work before moving on to a more optimized version — grube, Oct 26 '22 at 15:17
Grids can be very large! If the threads cannot run concurrently, they are executed one after the other (or several after several). Only blocks and thread block cluster (available on Hopper) are guaranteed to run at the same time. The maximum number of threads per kernel call is about 2^80, which is about 10^24. But for N=10^8 with N^2=10^16 elements I would definitely recommend an optimized version. How much memory do you have to hold the matrix? Dozens of Exabyte? All DRAM or how fast can you access it? — Sebastian, Oct 26 '22 at 17:17

score 1 · Answer 1 · answered Oct 27 '22 at 03:18

I should have mentioned that I want this to work for N=10^8, so creating a grid with enough threads isn't an option unfortunately. Regarding performance, you are right of course, this idea is not very fast.. However, I still want to make it work before moving on to a more optimized version

I doubt that scale means you can't create enough threads, but we can set that aside.
I concur with the comments that are suggesting that you may want to think carefully about your algorithm starting point for efficient GPU usage, but you seem to have acknowledged that already.

With that preamble, I believe what you are asking for is how to convert a linear index into a 2-dimensional triangular index.

That problem can be solved without using iteration, recursion, or floating-point arithmetic, or square-root or other advanced functions, just addition, subtraction, multiplication, and division.

Once we reduce the problem to one that can be solved using a linear index, then we can apply methodologies like grid-stride loop on a linear index, with conversion to 2D triangular "on the fly" to allow an arbitrary number of threads to work on and complete the problem.

Based on those ideas, here is one possible approach:

$ cat t2137.cu
#include <iostream>

// conversion of linear index to triangular matrix (row,col) index
__host__ __device__ void linear_to_triangular(const unsigned i, const unsigned n, unsigned &trow, unsigned &tcol)
{
    int c = i;
    int row = c/(n-1);
    int col = c - (row*(n-1)); // effectively modulo

    if (col < row) {
      col = (n-2)-col;
      row = (n-1)-row;
    }

    tcol = col+1;
    trow = row;
}

const size_t mdim = 547;
using mt = int;
const size_t tdim = 136;
const unsigned bdim = 64;

template <typename T>
__global__ void t(T *m, const size_t n){

  for (int idx = blockIdx.x*blockDim.x+threadIdx.x; idx < ((n*(n-1)) >> 1); idx += gridDim.x*blockDim.x){ // 1D grid-stride loop
    unsigned i,j;
    linear_to_triangular(idx, n, i, j);
    T e1 = m[i*n+j];
    T e2 = m[j*n+i];
    m[i*n+j] = e2;
    m[j*n+i] = e1;
    }
}

int main(){

  mt *d, *h;
  h = new mt[mdim*mdim];
  cudaMalloc(&d, mdim*mdim*sizeof(mt));
  for (int i = 0; i < mdim*mdim; i++) h[i] = i;
  cudaMemcpy(d, h, mdim*mdim*sizeof(mt), cudaMemcpyHostToDevice);
  t<<<(tdim+bdim-1)/bdim, bdim>>>(d, mdim);
  cudaMemcpy(h, d, mdim*mdim*sizeof(mt), cudaMemcpyDeviceToHost);
  for (int i = 0; i < mdim*mdim; i++) {
    int row = i/mdim;
    int col = i - row*mdim;
    if (h[col*mdim+row] != i) {std::cout << "Mismatch at: " << i << " was: " << h[col*mdim+row] << " should be: " << i << std::endl; return 0;}
    }
}

$ nvcc -o t2137 t2137.cu
$ compute-sanitizer ./t2137
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

I don't see much point in going deeply into performance. A naive transpose such as what you have shown can have one "coalesced" half but will then have the other half (badly) "uncoalesced". As far as that goes, this algorithm isn't any different. The linear-to-triangular conversion will still tend to create indices such that adjacent threads will often be loading (or storing, depending on how you write it) adjacent elements, but of course the staggered pattern of the triangular data patch introduces discontinuities in the adjacency pattern. And if your loads are partly coalesced, your stores won't be.

CUDA-matrix-transpose-in-place-implementation for NxN matrices that iterates over upper traingular entries only

1 Answers1