1

I'm trying to compute Fourier transforms using CUDA on a nvidia GTX TITAN graphic card. I have a problem when reaching a certain number of blocks of my card.

Here is what my card tells me when using cudaGetDeviceProperties:

  • maxThreadsPerBlock: 1024
  • maxThreadsDim: 1024 x 1024 x 64
  • maxGridSize: 2147483647 x 65535 x 65535

Here is the code I use to call my kernel function:

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);

unsigned int threads = prop.maxThreadsPerBlock;
unsigned int max_blocks = prop.maxGridSize[0];
unsigned int blocks = (pixel_size + threads - 1) / threads;

// Hardware limit
if (blocks > max_blocks)
  blocks = max_blocks;

kernel_function <<<blocks, threads>>>(pixel_size);

And the kernel code:

__global__ void kernel_function(unsigned int pixel_size)
{
  unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

  while (index < pixel_size)
  {
    // Treatment here
    index += blockDim.x * gridDim.x;
  }
}

Where pixel_size is the size in pixels of an image block I want to do transforms on.

So threads is always equal to 1024, which is what I want. Whenever blocks are inferior or equals to 65535, then my code works fine. But when blocks reaches a value above 65535, the results I have are a nonsense and totally random. So what is the maximum number of blocks I can have in a one dimension problem ? I assumed in the previous code that it was 2147483647 ? What am I doing wrong ?

I feel like I am using the wrong hardware limit for my number of blocks because when I set it to 65535, this code is working fine.

Thank you in advance for your answers.

Jeff Bencteux
  • 1,406
  • 16
  • 27
  • Did you do [proper error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to see if the kernel is successfully executed? Can you elaborate a playable example? – pQB Oct 31 '14 at 14:31
  • 4
    And are you compiling for the correct architecture? – talonmies Oct 31 '14 at 14:46
  • It was actually the problem ... Thanks very much. Compiling options are compute_35, sm_35 instead of compute_20, sm_20. – Jeff Bencteux Oct 31 '14 at 15:39
  • 1
    @talonmies Your comment was simple but the answer. Should you add it as an answer or should the question be removed? (no offense here Jeffrey, but you didn't show too much effort in the question and probably asked asked before use the head :) – pQB Oct 31 '14 at 17:00
  • As a matter of housekeeping, I'd like to see an answer posted to the question. I do not care who does it. I hope I am not imposing on @talonmies by saying that I respectfully think he will not care who does it either. I would suggest that Jeffrey provide an answer, explaining that a change was necessary to the compile command. The question is of some value to future readers that way. If I see no answer to it after a few days, and I am reminded of it, I will probably come back and post an answer. – Robert Crovella Oct 31 '14 at 20:49

1 Answers1

4

Problem solved, I was compiling with flags for 2.x architecture instead of 3.5 so the 2.x limit was applying (wich is 65535 blocks max on x dimension). After compiling with compute_35, sm_35, it worked.

Thanks @talonmies.

Jeff Bencteux
  • 1,406
  • 16
  • 27