I am new to cuda and learning GPU programming. I want to add two nxm matrices (float* A and float* B) and store the results in float* C in the kernel. The goal is to get the fastest implementation. I have the following questions:
I was wondering how to arange the blocks and grid to get the best performance ( for both small and large n and m)
It is good to assign one thread to each element of matrices. However, for large n and m it is not possible. What is the best option then?
How can matrix padding improve the performance?