I ran some CUDA code that updated an array of floats. I have a wrapper function like the one discussed in How can I compile CUDA code then link it to a C++ project? this question.
Inside my CUDA function I create a for loop like this...
int tid = threadIdx.x;
for(int i=0;i<X;i++)
{
//code here
}
Now the issue is that if X is equal to the value of 100, everything works just fine, but if X is equal to 1000000, my vector does not get updated (almost as if the code inside the for loop does not get executed)
Now inside the wrapper function, if I call the CUDA function in a for loop, it still works just fine, (but is significantly slower for some reason than if I simply did the same process all on the CPU) like this...
for(int i=0;i<1000000;i++)
{
update<<<NumObjects,1>>>(dev_a, NumObjects);
}
Does anyone know why I can loop a million times in the wrapper function but not simply call the CUDA "update" function once and then inside that function start a for loop of a million?