I have write a CUDA function same as cublasSdgmm
in CUBLAS, and I find when I increase block number, function performance may be poorer, or even failed.
Here is the code, M = 9.6e6, S = 3, the best performance block number is 320, my GPU is GTX960, and the max block size is 2147483647 in X dimension.
__global__ void DgmmKernel(float *d_y, float *d_r, int M, int S){
int row = blockIdx.x*blockDim.x + threadIdx.x;
int col = blockIdx.y*blockDim.y + threadIdx.y;
while(row < M){
d_y[row + col * M] *= d_r[row];
row += blockDim.x * gridDim.x;
}
}
void Dgmm(float *d_y, float *d_r, int M, int S){
int xthreads_per_block = 1024;
dim3 dimBlock(xthreads_per_block, 1);
dim3 dimGrid(320, S);
DgmmKernel<<<dimBlock, dimGrid>>>(d_y, d_r, M, S);
}
I guess the reason is that there may be a resource limit in GPU, is it right?
If it is right, what specific resource limits the performance, the kernel function just reads two vectors and do a multiplication operation. And is there any method to improve performance on my GPU.