0

I used x & y for calculating cells of a matrix in device. when I used more than 32 for lenA & lenB, the breakpoint (in int x= threadIdx.x; in device code) can't work and output isn't correct.

in host code:

int lenA=52;
int lenB=52;

dim3 threadsPerBlock(lenA, lenB);
dim3 numBlocks(lenA / threadsPerBlock.x, lenB / threadsPerBlock.y);

kernel_matrix<<<numBlocks,threadsPerBlock>>>(dev_A, dev_B);

in device code:

int x=  threadIdx.x;
int y=  threadIdx.y;
...
mahdimb
  • 139
  • 1
  • 4
  • 11
  • You forgot to ask a question... – talonmies May 13 '13 at 19:24
  • Why with more than 32 value for lenA or lenB breakpoint can't work and the answer of program is wrong but with smaller than 32 everything is OK? Do I need a different approach for initialized x & y? – mahdimb May 13 '13 at 19:32
  • That should be written into your question, not dropped as a comment. Remember, this question and answer exist as much for the next person that comes along with the same question as it does for your help. – talonmies May 14 '13 at 05:52

1 Answers1

2

Your threadsPerBlock dim3 variable must satisfy the requirements for the compute capability that you are targetting.

CC 1.x devices can handle up to 512 threads per block

CC 2.0 - 8.6 devices can handle up to 1024 threads per block.

Your dim3 variable at (32,32) is specifying 1024 (=32x32) threads per block. When you exceed that you are getting a kernel launch fail.

If you did cuda error checking on your kernel launch, you would see the error.

Since the kernel doesn't actually launch with this type of error, any breakpoints set in the kernel code also won't be hit.

Additional notes:

  1. You won't get any compilation error for threads per block, regardless of what you do. It doesn't work that way. The compiler doesn't check that.

  2. If you do proper CUDA error checking you will get a runtime error report, and even if you don't do proper CUDA error checking, your kernel will not actually run with that sort of error.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks, i can only use first 32 threads, how to can access over 32 threads with x & y? – mahdimb May 13 '13 at 19:45
  • 1
    You handle those elements in other threadblocks. Each threadblock handles 1024 data elements, which could be a 32x32 block, or a 64x16 block, or whatever numbers you like, so that the total does not exceed 1024. If you simply want to *access* other elements in some data array, you use ordinary indexing for that. – Robert Crovella May 13 '13 at 19:50