-1

I have write a CUDA function same as cublasSdgmm in CUBLAS, and I find when I increase block number, function performance may be poorer, or even failed.

Here is the code, M = 9.6e6, S = 3, the best performance block number is 320, my GPU is GTX960, and the max block size is 2147483647 in X dimension.

__global__ void DgmmKernel(float *d_y, float *d_r, int M, int S){
  int row = blockIdx.x*blockDim.x + threadIdx.x;
  int col = blockIdx.y*blockDim.y + threadIdx.y;
  while(row < M){
    d_y[row + col * M] *= d_r[row];
    row += blockDim.x * gridDim.x;
  }
}
void Dgmm(float *d_y, float *d_r, int M, int S){
  int xthreads_per_block = 1024;                                                                 
  dim3 dimBlock(xthreads_per_block, 1);                                        
  dim3 dimGrid(320, S);                                             
  DgmmKernel<<<dimBlock, dimGrid>>>(d_y, d_r, M, S);                
}       

I guess the reason is that there may be a resource limit in GPU, is it right?

If it is right, what specific resource limits the performance, the kernel function just reads two vectors and do a multiplication operation. And is there any method to improve performance on my GPU.

talonmies
  • 70,661
  • 34
  • 192
  • 269
zjhthu
  • 129
  • 1
  • 5

1 Answers1

2

You have the block and grid dimension arguments reversed in your kernel launch, and your kernel should never be running. You should do something like this:

  dim3 dimBlock(xthreads_per_block, 1);                                        
  dim3 dimGrid(320, S);                                             
  DgmmKernel<<<dimGrid, dimBlock>>>(d_y, d_r, M, S);  

If your code contained appropriate runtime error checking, you would already be aware that the kernel launch is failing with an invalid configuration error for any value of S>3.

Community
  • 1
  • 1
talonmies
  • 70,661
  • 34
  • 192
  • 269