When I assess my program, I saw that at some point I get up to 100msec time lapse. I have searched every operation, but individually no operation was taking this time. Then I have noticed that wherever I do place cudaThreadSynchronize call, the first call takes 100 msec. Then I have written such an example below. When cudaThreadSynchronize is called at the first line, the elapsed time value at the end is found less than 1 msec. But if it is not called then it takes 110msec on average.
int main(int argc, char **argv)
{
cudaThreadSynchronize(); //Comment out it then get 110msec as elapsed time..
unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);
float *data;
CUDA_SAFE_CALL(cudaMalloc(&data, sizeof(float) * 1024));
cutStopTimer(timer);
printf("CUT Elapsed: %.3f\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);
return EXIT_SUCCESS;
}
I think cudaThreadSynchronize() at the start handles the initialization of the CUDA library. Is it the correct way to fully initialize the kernel, so it will not affect other operations' time assessment? Is it enough, and correct to call cudaThreadSynchronize at the start, or is there any correct way..