Very long D2H time vs. H2D (CUDA)

Question

I have an application that uses CUDA to processes data. The basic flow is:

Transfer data H2D (this is around 1.5k integers)
invoke several kernels that transform and reduce data to a single int value
Copy result D2H

Profiling with NSight shows that the H2D and D2H transfers average around 13 uS and 70 uS respectively. This is weird to me as the D2H is moving a tiny amount of data compared to H2D.

Both input and output memory locations are pinned.

Is this this difference in transfer duration expected or am I doing something wrong?

//allocating the memory locations for IO
cudaMallocHost((void**)&gpu_permutation_data, size_t(rowsPerThread) * size_t(permutation_size) * sizeof(keyEntry));
cudaMallocHost((void**)&gpu_constant_maxima, sizeof(keyEntry));

//H2D
cudaMemcpy(gpu_permutation_data, input.data(), size_t(permutation_size) * size_t(rowsPerThread) * sizeof(keyEntry), cudaMemcpyHostToDevice);

// kernels go here

//D2H
cudaMemcpy(&result, gpu_constant_maxima, sizeof(keyEntry), cudaMemcpyDeviceToHost);

What matters the most is that that host memory is pinned, not the one accessed by the GPU. Here `gpu_permutation_data` and `gpu_constant_maxima` are supposed to be accessed by the GPU while `input.data()` and `&result` should be located on the CPU. The thing is they are certainly not pinned. In fact I am not even sure the code is correct since `cudaMallocHost` is for host allocation and not device allocations (the memory should be accessible by the GPU but what is the point to perform `cudaMemcpy` calls then?)... — Jérôme Richard, Jan 17 '22 at 23:31
https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/ — Treeman, Jan 18 '22 at 00:45
You're looking at the duration of the API call, not the actual transfer time. cudaMemcpy is a blocking call. The CPU calls it, then it waits there for the GPU kernel to complete. — Robert Crovella, Jan 18 '22 at 01:08
That makes sense now. Wasn't really interpreting NSight output correctly I think - it shows the kernels starting and finishing before the second memCpy. Are those just the kernel launches rather than the kernels themselves? Similarly are the kernel durations there for launches or execution? — Treeman, Jan 18 '22 at 04:41
A CUDA API call (such as `cudaMemcpy`) can be "measured" in 2 ways: 1. how much time does the activity actually spend 2. how much time does the api call require (from start of call until control is returned to CPU thread for next line of code). See [here](https://stackoverflow.com/questions/55574096/what-is-the-difference-between-gpu-activities-and-api-calls-in-the-results-o/55574423#55574423). A kernel call/launch is the same. Depending on which line in the profiler you are looking at, you may be looking at GPU activity (which will be the actual time the kernel is running) or API call. — Robert Crovella, Jan 20 '22 at 20:52

score 1 · Answer 1 · answered Feb 18 '22 at 19:54

1

As Robert pointed out, NSight displays the time from API start to finish, so the time between when the copy API is called and when it actually starts (after previous kernels are done) is included.

answered Feb 18 '22 at 19:54

Treeman

100
6

Very long D2H time vs. H2D (CUDA)

1 Answers1