I have an application that uses CUDA to processes data. The basic flow is:
- Transfer data H2D (this is around 1.5k integers)
- invoke several kernels that transform and reduce data to a single int value
- Copy result D2H
Profiling with NSight shows that the H2D and D2H transfers average around 13 uS and 70 uS respectively. This is weird to me as the D2H is moving a tiny amount of data compared to H2D.
Both input and output memory locations are pinned.
Is this this difference in transfer duration expected or am I doing something wrong?
//allocating the memory locations for IO
cudaMallocHost((void**)&gpu_permutation_data, size_t(rowsPerThread) * size_t(permutation_size) * sizeof(keyEntry));
cudaMallocHost((void**)&gpu_constant_maxima, sizeof(keyEntry));
//H2D
cudaMemcpy(gpu_permutation_data, input.data(), size_t(permutation_size) * size_t(rowsPerThread) * sizeof(keyEntry), cudaMemcpyHostToDevice);
// kernels go here
//D2H
cudaMemcpy(&result, gpu_constant_maxima, sizeof(keyEntry), cudaMemcpyDeviceToHost);