0

I ran some CUDA code that updated an array of floats. I have a wrapper function like the one discussed in How can I compile CUDA code then link it to a C++ project? this question.

Inside my CUDA function I create a for loop like this...

int tid = threadIdx.x;
for(int i=0;i<X;i++)
{
     //code here
}

Now the issue is that if X is equal to the value of 100, everything works just fine, but if X is equal to 1000000, my vector does not get updated (almost as if the code inside the for loop does not get executed)

Now inside the wrapper function, if I call the CUDA function in a for loop, it still works just fine, (but is significantly slower for some reason than if I simply did the same process all on the CPU) like this...

for(int i=0;i<1000000;i++)
{
      update<<<NumObjects,1>>>(dev_a, NumObjects);
}

Does anyone know why I can loop a million times in the wrapper function but not simply call the CUDA "update" function once and then inside that function start a for loop of a million?

Community
  • 1
  • 1
Matthew
  • 3,886
  • 7
  • 47
  • 84
  • 2
    possible duplicate of [CUDA limit seems to be reached, but what limit is that?](http://stackoverflow.com/questions/6913206/cuda-limit-seems-to-be-reached-but-what-limit-is-that) – talonmies Mar 01 '12 at 14:23
  • When you use the larger value of X, does your kernel execute at all? Are you doing any error checking? You should. Is X a compile-time constant or #define? If so, are you checking the shared and constant memory requirements, and number of registers, using appropriate compiler flags? Are you then exploring the consequences using the NVIDIA CUDA Occupancy Calculator? Lots of things could be going on. – Patrick87 Mar 01 '12 at 16:35
  • Thanks Patrick... X is simply a variable for the purpose of this post. I normally replace "x" with a hardcoded value like "1000000" Talonmies has a good post and I believe that is the reason why... – Matthew Mar 01 '12 at 21:35

1 Answers1

0

You should be using cudaThreadSynchronize and cudaGetLastError after running this to see if there was some error. I imagine the first time, it timed out. This happens if the kernel takes a long time to complete. The card just gives up on it.

The second thing, the reason it takes much longer to execute, is because there is a set overhead time for each kernel launch. When you had the loop inside the kernel, you experienced this overhead once and ran the loop. Now you're experiencing it X times. The overhead is fairly small, but large enough that as much of the loop should be put inside the kernel as possible.

If X is particularly large, you might look into running as much of the loop in the kernel as possible until it completes in a safe amount of time, and then loop over these kernels.

P O'Conbhui
  • 1,203
  • 1
  • 9
  • 16