I am trying to understand CUDA warps and thread divergence. Suppose I have a naive matrix multiplication kernel to multiply n x n matrices.
__global__ void matrix_multiply(float* a, float* b, float* c, int n)
{
int row = blockIdx.y + blockDim.y + threadIdx.y;
int col = blockIdx.x + blockDim.x + threadIdx.x;
if(row < n && col < n) {
float tmp = 0.0f;
for(int i = 0; i < n; ++i)
tmp += a[row * n + i] * b[i * n + col];
c[row * n + col] = tmp;
}
}
If I launch a kernel with grid size 32 by 32 and block size 16 by 16 and the matrices are 500 by 500, how many warps have threads which will encounter thread divergence?
Since each thread block on the right edge of the matrix will have thread divergence, shouldn't the number of warps with thread divergence be 256?