Performance difference due to indexing during matrix multiplication

Question

I'm trying out the difference between using a tiled and naive implementation in CUDA C++. I expect to see a performance gap in these variations because of the repeated usage of shared memory. However, the speedup was only about twice as fast (naive ~12ms and tiled ~6ms). Here are the code snippets:

#include <iostream>
#include <assert.h>
using namespace std;

# define N 1024
# define THREADS 16
# define IDX(x, y, s) (x*s + y)

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

void init_values(int *a, int *b, int sz) {
    for(int i=0; i<sz; i++) {
        a[i] = rand()%513 - 256;
        b[i] = rand()%513 - 256;
    }
}

__global__ 
void matmul(int *a, int *b, int *c, int n) {
    // perform parallel matmul
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int t = 0;
    for(int i=0; i<n; i++) {
        t += (a[IDX(x, i, n)] * b[IDX(i, y, n)]);
    }
    c[IDX(x, y, n)] = t;
}

void matmul_verify(int *a, int *b, int *c, int n) {
    for(int i=0; i<n; i++) {
        for(int j=0; j<n; j++) {
            int t = 0;
            for(int k=0; k<n; k++)
                t += a[IDX(i, k, n)] * b[IDX(k, j, n)];
            // cout << i << " " << j << " " << c[IDX(i, j, n)] <<  " " << t <<  endl;
            assert(c[IDX(i, j, n)] == t);
        }
    }
}

int main()
{
    int *a, *b, *c;
    int *da, *db, *dc;
    size_t sz = N * N * sizeof(int);

    a = (int*)malloc(sz);
    b = (int*)malloc(sz);
    c = (int*)malloc(sz);
    init_values(a, b, N*N);

    gpuErrchk(cudaMalloc((void**)&da, sz));
    gpuErrchk(cudaMalloc((void**)&db, sz));
    gpuErrchk(cudaMalloc((void**)&dc, sz));

    gpuErrchk(cudaMemcpy(da, a, sz, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(db, b, sz, cudaMemcpyHostToDevice));
    // init grid size
    dim3 grids(N/THREADS, N/THREADS);
    dim3 blocks(THREADS, THREADS);
    // time it
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    matmul<<<grids, blocks>>>(da, db, dc, N);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cout << "Took " << milliseconds << " milliseconds.\n";

    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    gpuErrchk(cudaMemcpy(c, dc, sz, cudaMemcpyDeviceToHost));
    matmul_verify(a, b, c, N);

    cudaFree(da);
    cudaFree(db);
    cudaFree(dc);
    free(a);
    free(b);
    free(c);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return 0;
}

and for the tiled implementation, I change the kernel as

__global__ 
void matmul(int *a, int *b, int *c, int n) {
    // perform parallel matmul
    int ty = threadIdx.y, by = blockIdx.y;
    int tx = threadIdx.x, bx = blockIdx.x;
    int x = bx * blockDim.x + tx;
    int y = by * blockDim.y + ty;
    // block IDs tell us which block to solve for
    // (bx, by) --> (bx: bx + tx, by:by + ty)
    __shared__ int A[SHMEM_SIZE];
    __shared__ int B[SHMEM_SIZE];

    const int tile_size = THREADS;

    // to get value of tile [tx, ty] in block [bx, by], we need blocks A[bx, *] and blocks B[*, by]
    int res = 0;

    for(int blk=0; blk < n; blk+=tile_size)  {
        // block index
        A[IDX(tx, ty, tile_size)] = a[IDX(x, blk + ty, n)];
        B[IDX(tx, ty, tile_size)] = b[IDX(blk + tx, y, n)];
        __syncthreads();
        for(int k=0; k<tile_size; k++) {
            res += (A[IDX(tx, k, tile_size)] * B[IDX(k, ty, tile_size)]);
        }
        __syncthreads();
    }

    // for(int k=0; k<n; k++)
    //     res += a[IDX(x, k, n)] * b[IDX(k, y, n)];
    c[IDX(x, y, n)] = res;
}

nothing else really changes. However, in the tiled implementation, if I simply change

    int ty = threadIdx.x, by = blockIdx.x;
    int tx = threadIdx.y, bx = blockIdx.y;

for the initialization of thread and block indices, I get about a ~1ms runtime (12x speedup). How is this happening? I read from the book "CUDA By Example" that the thread and block indices in 2 dimensions are just for programmer convenience and do not reflect any difference in performance. This seems to be false. Any clarification is really appreciated.

It seems like you pasted the 1ms version already instead of the 6ms version? Or you didn't correctly specify what you changed to get to 1ms. Am I overlooking something? — paleonix, Apr 16 '22 at 14:19
I think I know what is happening: You are using `threadIdx.y` as the continuous dimension of your matrices. This doesn't allow [coalescing](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/). Always use `threadIdx.x` for that. See the corresponding [CUDA sample](https://github.com/NVIDIA/cuda-samples/blob/master/Samples/0_Introduction/matrixMul/matrixMul.cu). — paleonix, Apr 16 '22 at 14:25
yes, sorry. I edited the post. what I changed is the ordering of x and y indices — Rohit Jena, Apr 16 '22 at 14:31

paleonix · Accepted Answer · 2022-04-16T16:53:50.877

1

CUDA thread blocks are partitioned into warps of 32 threads. Ideally the neighboring lanes of a warp should always load neighboring elements from global memory. This is called coalescing and allows for maximum memory bandwidth. In hardware all the coalesced loads from a warp will be bundled into a minimal number of memory transactions.

Other factors that can deteriorate memory bandwidth are the size of the load (one can try to use the builtin vector types to get bigger loads for optimization, e.g. int2, int4, float2, etc.) and alignment.

The mapping from 3D threadIdx to warp lanes always takes the first dimension .x as the continuous dimension, i.e. a block of dimensions (32, 2, 1) will have one warp with threadIdx.y == 0 and one warp with threadIdx.y == 1 where the lanes of each warp correspond to threadIdx.x.

Therefore to allow for coalescing, you have to access memory as

A[ty * s + tx] // coalesced access

instead of

A[tx * s + ty] // strided access

to achieve optimal performance.

What is probably meant in the book you mentioned is that there shouldn't be a performance difference between launching a grid of (32, 2, 1) blocks and a grid of (64, 1, 1) blocks while manually getting ty = threadIdx.x / 32 and tx = threadIdx.x % 32. These divisions probably happen internally when having a block that is not flat in the first place.

edited Apr 16 '22 at 16:53

answered Apr 16 '22 at 14:53

paleonix

2,293
1
13
29

Yes. A minor point to add would be that CUDA memory is column-major, unlike host memory (C/C++ assumes row-major for cache-based improvements). I was under the assumption that device memory is also row-major, and therefore `A[tx * s + ty]` made sense to me. – Rohit Jena Apr 16 '22 at 14:59
Everything in C++/CUDA is row-major if not specified differently. The memory itself is flat, it can be neither column nor row-major. I'm not sure what you mean. – paleonix Apr 16 '22 at 15:02
The point is that in CPU multi-threading doing an access pattern like above would be bad due to cache thrashing, as multiple cores would work on the same cache line, but in CUDA a thread is more like a SIMD lane and therefore this access pattern makes sense. – paleonix Apr 16 '22 at 15:07
For completeness: The memory hardware is actually 2D (see e.g. the talk "How CUDA works" by Stephen Jones from GTC'22, it should appear on [NVIDIA On-Demand](https://www.nvidia.com/en-us/on-demand/) later this year), but this is not relevant here. – paleonix Apr 16 '22 at 15:15
I guess you want to say that `threadIdx` isn't mapped to warps the way one might assume in C/C++, but this has nothing to do with "CUDA memory" and it is already part of the answer. – paleonix Apr 16 '22 at 15:21
1

A minor point is that for coalesced memory access the alignment should also be optimal depending on the architecture, otherwise the transaction is split. – Sebastian Apr 16 '22 at 16:21
@Sebastian I have included it now, but I'm not sure if and what I should mention here about detecting and solving alignment issues in practice. I guess profiling with `ncu` could be a first step. It seems to be far harder to do it wrong as long as one uses powers of two for block sizes, etc. – paleonix Apr 16 '22 at 16:32
1

I think the addition is good. (More details, cf. https://stackoverflow.com/questions/63497910/cache-behaviour-in-compute-capability-7-5). Very important is to keep the data in few (typically 4 for int/float) 32 bytes blocks per instruction. Even if each block accesses 4 continuous bytes, you would need 5 sectors instead of 4, if your lowest address is not aligned to 32 bytes. However, for very fast memory accesses with cache hits from L1 cache, the alignment requirement increases to 128 bytes. The data can be distributed in any way between the threads and several threads can read the same data – Sebastian Apr 16 '22 at 23:52

Performance difference due to indexing during matrix multiplication

1 Answers1