34

Here is my code:

int threadNum = BLOCKDIM/8;
dim3 dimBlock(threadNum,threadNum);
int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1);
int blocks2 = nHeight/threadNum + (nHeight%threadNum == 0 ? 0 : 1);
dim3 dimGrid;
dimGrid.x = blocks1;
dimGrid.y = blocks2;

//  dim3 numThreads2(BLOCKDIM);
//  dim3 numBlocks2(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1) );
perform_scaling<<<dimGrid,dimBlock>>>(imageDevice,imageDevice_new,min,max,nWidth, nHeight);
cudaError_t err = cudaGetLastError();
cudasafe(err,"Kernel2");

This is the execution of my second kernel and it is fully independent in term of the usage of data. BLOCKDIM is 512 , nWidth and nHeight are 512 too and cudasafe simply prints the corresponding string message of the error code. This section of the code gives configuration error just after the kernel call.

What might give this error, any idea?

Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
erogol
  • 13,156
  • 33
  • 101
  • 155

2 Answers2

57

This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it's a good idea just to print out your actual config parameters before launching the kernel, to see if you've made any mistakes.

You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:

dim3 dimBlock(threadNum,threadNum);

So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won't work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.

Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I know that my card has configuration of 1024 threads for each block. Is it same to have 32*32 2D configuration and havin 1D 1024 thread configuration ? – erogol Apr 20 '13 at 21:47
  • 6
    1024 threads is the limit on a per-block basis. You can have any set of 1D, 2D, or 3D dimensions that don't exceed this. So 1024x1, 512x2, 256x4, 128x8, etc. are all acceptable 2D limits. Similarly for 3D, e.g. 16x8x8, 32x8x4, 64x4x4, etc. are all acceptable 3D limits. The `deviceQuery` cuda sample will give information about total and per-dimension limits. But regardless of the per-dimension limits, the actual total product cannot exceed the total limit of 1024 or whatever is appropriate for your device. – Robert Crovella Apr 20 '13 at 21:52
2

Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:

struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;
Niko
  • 642
  • 6
  • 12