0

I'm trying to utilize cuda to accelerate my codes, it's been working, until this kernel. What the kernel should do, is to copy a mxn matrix d_A, to another memory space, ignoring the pth row. Resulting in a (m-1)xn matrix.

__global__ void copyWOp(int m, int n, int p, double* d_tobeCopied, double* d_Copied) //copy tobeCopied to copied without pth row
{
    int thread2Dpx=blockIdx.x * blockDim.x + threadIdx.x;
    int thread2Dpy=blockIdx.y * blockDim.y + threadIdx.y;
    if (thread2Dpx>=m || thread2Dpy>=n)
        return;
    int thread1Dp=thread2Dpy*(m)+thread2Dpx;

    if (thread2Dpx<p)
      d_Copied[thread2Dpy*(m-1)+thread2Dpx]=d_tobeCopied[thread1Dp];
    else if (thread2Dpx==p)
      return;
    else
      d_Copied[thread2Dpy*(m-1)+thread2Dpx-1]=d_tobeCopied[thread1Dp];
} 

and the following is how I call the kernel

cudaMalloc(&d_newA,(m-1)*n*sizeof(double));
const dim3 blockSize1(32,32,1);
const dim3 gridSize1 ((m + blockSize1.x - 1) / blockSize1.x, (n + blockSize1.y - 1) / blockSize1.y,1);
copyWOp<<<blockSize1,gridSize1>>>(m,n,p,d_A,d_newA);
cudaFree(d_A);

d_A=d_newA;

But somehow, when I checked the norm of d_newA after the kernel, it gives a flat zero, where d_A is not. So the kernel is clearly not working.

I have several similar kernel written, and they all work using the exact same indexing variables.

I know the kernel is quite naive, but I want to get things working first before optimizing.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Roger Lee
  • 78
  • 5

1 Answers1

0

You have your block and grid kernel launch parameters reversed. That is certainly a problem that most likely will prevent your kernel from launching:

const dim3 blockSize1(32,32,1);
const dim3 gridSize1 ((m + blockSize1.x - 1) / blockSize1.x, (n + blockSize1.y - 1) / blockSize1.y,1);
copyWOp<<<blockSize1,gridSize1>>>(m,n,p,d_A,d_newA);
            ^ 
            |  
         grid parameter comes first, then block

Any time you're having trouble with a CUDA code, it's a good idea to use proper cuda error checking. You would likely have gotten an indication that something was wrong with the kernel, and the actual error could indicate that the problem is specifically with one of your kernel launch parameters.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • you are indeed correct! But this leaves me wondering why my all previous kernel called this way works. Maybe it's because of the dimension is not as large, so I didn't ask for too much thread in a block even if I reversed it? – Roger Lee Mar 20 '15 at 03:37
  • It's difficult to say without seeing the code. It's certainly possible to have "compatible" grid and block dimensions, such as 16x16. – Robert Crovella Mar 20 '15 at 06:35