CUDA host to device transfer faster than device to host transfer

Question

I was working on a simple cuda program in which I figured out that 90% of the time was coming from a single statement which was a cudamemcpy from device to host. The program was transfering some 2MB was data from host to device in 600-700microseconds and was copying back 4MB of data from device to host in 10ms. The total time taken by my program was 13ms. My question is that why there is an asymmetry in the two copying from device to host and host to device. Is it because cuda devlopers thought that copying back would be usually smaller in bytes. My second question is that is there any way to circumvent it.

I am using a Fermi GTX560 graphics card with 343 cores and 1GB memory.

This is most likely a timing artifact and not real. Kernel launches are asynchronous, so in all likelihood, the 10ms of the device-host transfer includes kernel execution times. — talonmies, Jul 14 '12 at 19:33
I don't think so. I am using rdtsc which is a hardware counter and I have put two counter stamps just above and below the cudaMemcpy(...);. Moreover to prevent the noise from entering the system I have repeated the experiments again and again. Kernel launches are asynchronous but I am not using cudaMemcpyAsync. and it cannot be executed before kernel call ends. — Dipendra Kumar Mishra, Jul 14 '12 at 19:46
Try putting a cudaDeviceSynchronize() call before the device to host copy. I predict the time measured for the cudaMemcpy call will be greatly reduced. — talonmies, Jul 14 '12 at 19:52
Thanks it did and it reduced to 1ms. Interestingly I had commented out cudaDeviceSynchronize for speed reasons :(. Thanks again. — Dipendra Kumar Mishra, Jul 14 '12 at 19:54
@talonmies That worked magically somehow - is there an explanation of why this works? I was able to reduce a memCpy from 15 ms to < 1 ms. — Gokul, Jan 27 '21 at 13:06

score 2 · Answer 1 · edited May 23 '17 at 11:49

Timing of CUDA functions is a bit different than CPU. First of all be sure that you do not take the initialization cost of CUDA into account by calling a CUDA function at the start of your application, otherwise it might be initialized while you started your timing.

int main (int argc, char **argv) {
   cudaFree(0);
   ....//cuda is initialized..

}

Use a Cutil timer like this

unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);

//your code, to assess elapsed time..

cutStopTimer(timer);
printf("Elapsed: %.3f\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);

Now, after these preliminary steps lets look at the problem. When a kernel is called, the CPU part will be stalled only till the call is delivered to GPU. The GPU will continue execution while the CPU continues too. If you call cudaThreadSynchronize(..), then the CPU will stall till the GPU finishes current call. cudaMemCopy operation also requires GPU to finish its execution, because the values that should be filled by the kernel is requested.

kernel<<<numBlocks, threadPerBlock>>>(...);

cudaError_t err = cudaThreadSynchronize();
if (cudaSuccess != err) {
  fprintf(stderr, "cudaCheckError() failed at %s:%i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err ) );
  exit(1);
}

//now the kernel is complete..
cutStopTimer(timer);

So place a synchronization before calling the stop timer function. If you place a memory copy after the kernel call, then the elapsed time of memory copy will include some part of the kernel execution. So memCopy operation may be placed after the timing operations.

There are also some profiler counters that may be used to assess some sections of the kernels.

How to profile the number of global memory transactions for cuda kernels?

How Do You Profile & Optimize CUDA Kernels?

`cudaThreadSynchronize()` is deprecated and `cudaDeviceSynchronize()` should be used instead. [source1](https://stackoverflow.com/a/13485891), [source2](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__THREAD__DEPRECATED.html#group__CUDART__THREAD__DEPRECATED_1g86ba4d3d51221f18a9a13663e6105fc1) — Ewa, Apr 21 '19 at 19:13

CUDA host to device transfer faster than device to host transfer

1 Answers1