Cuda Efficient Matrix Addition

Question

I am new to cuda and learning GPU programming. I want to add two nxm matrices (float* A and float* B) and store the results in float* C in the kernel. The goal is to get the fastest implementation. I have the following questions:

I was wondering how to arange the blocks and grid to get the best performance ( for both small and large n and m)
It is good to assign one thread to each element of matrices. However, for large n and m it is not possible. What is the best option then?
How can matrix padding improve the performance?

If you are looking for performance, then don't do it on the GPU. This calculation is too data-heavy to be efficient on the GPU. If you want to practice, then sure, do it. — Daniel Bauer, Jan 29 '19 at 13:52

score 1 · Answer 1 · answered Jan 29 '19 at 11:20

1: A simple method would be to store the matrix as a vector/array of floats where the rows are concatenated. Then you could just use a large number of threads per block and the smallest necessary number of blocks. Here is an example how the kernel could look like.

2: You basically can have a infinite number of threads, as long as the size of the matrix doesn't exceed the free memory on your GPU. They won't be executed simultaneously, but the driver will schedule them for you and you don't have to care about it.

A thread per element generally works good, if you want to try another way, have a look at Grid Stride Loops, which is a scalable method to organize your elements in less threads.

3: I don't see how padding would improve the performance as you get more elements to copy and calculate, but I'm no expert for that.

Padding is useful if the memory is not well-aligned for coalesced memory accesses. There's even a function for allocating padded memory (cudaMallocPitch). For a 2D array there's a small cost in allocating the additional elements, but the performance benefits can be substantial if your data dimensions without padding are misaligned with the size of a warp (as always, this is a problem-dependent claim and YMMV) — Michael, Feb 07 '19 at 23:13

Cuda Efficient Matrix Addition

1 Answers1