Timing of CUDA functions is a bit different than CPU. First of all be sure that you do not take the initialization cost of CUDA into account by calling a CUDA function at the start of your application, otherwise it might be initialized while you started your timing.
int main (int argc, char **argv) {
cudaFree(0);
....//cuda is initialized..
}
Use a Cutil timer like this
unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);
//your code, to assess elapsed time..
cutStopTimer(timer);
printf("Elapsed: %.3f\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);
Now, after these preliminary steps lets look at the problem. When a kernel is called, the CPU part will be stalled only till the call is delivered to GPU. The GPU will continue execution while the CPU continues too. If you call cudaThreadSynchronize(..), then the CPU will stall till the GPU finishes current call. cudaMemCopy operation also requires GPU to finish its execution, because the values that should be filled by the kernel is requested.
kernel<<<numBlocks, threadPerBlock>>>(...);
cudaError_t err = cudaThreadSynchronize();
if (cudaSuccess != err) {
fprintf(stderr, "cudaCheckError() failed at %s:%i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err ) );
exit(1);
}
//now the kernel is complete..
cutStopTimer(timer);
So place a synchronization before calling the stop timer function. If you place a memory copy after the kernel call, then the elapsed time of memory copy will include some part of the kernel execution. So memCopy operation may be placed after the timing operations.
There are also some profiler counters that may be used to assess some sections of the kernels.
How to profile the number of global memory transactions for cuda kernels?
How Do You Profile & Optimize CUDA Kernels?