Trying to understand in CUDA why is zero copy faster if it also travel through PCIe?

Question

It is said that zero copy should be used in situations where “read and/or write exactly once” constraint is met. That's fine.

I have understood this, but my question is why is zero copy fast in first place ? After all whether we use explicit transfer via cudamemcpy or zero copy , in both case data has to travel through pci express bus. Or there exist any other path ( i.e copy happen's directly in GPU register by passing device RAM ?

Low letency because of avoiding overheads of cudamemcpy function ? — gpuguy, Jan 08 '21 at 02:03
In terms of data transfer rate, its not faster. So the premise of the question is suspect. And there is no way to avoid PCIE (for a PCIE connected GPU) if there is transfer between host and device. There is no other path. — Robert Crovella, Jan 08 '21 at 02:25
There might be a number of situations where zero-copy might give improved overall application performance compared to using `cudaMemcpy`. I bet you can find examples if you look for them. If you like I can write an answer which describes maybe 2 examples. — Robert Crovella, Jan 08 '21 at 02:34
@Robert , yes kindly write an answer. I really want to understand how it actually work. Thanks in advance. — gpuguy, Jan 08 '21 at 10:11

score 5 · Answer 1 · answered Jan 08 '21 at 16:16

Considered purely from a data-transfer-rate perspective, I know of no reason why the data transfer rate for moving data between host and device via PCIE should be any different when comparing moving that data using a zero-copy method vs. moving it using cudaMemcpy.

However, both operations have overheads associated with them. The primary overhead I can think of for zero-copy comes with pinning of the host memory. This has a noticeable time overhead (e.g. when compared to allocating the same amount of data using e.g. malloc or new). The primary overhead that comes to mind with cudaMemcpy is a per-transfer overhead of at least a few microseconds that is associated with the setup costs of using the underlying DMA engine that does the transfer.

Another difference is in accessibility to the data. pinned/zero-copy data is simultaneously accessible between host and device, and this can be useful for some kinds of communication patterns that would otherwise be more complicated with cudaMemcpyAsync for example.

Here are two fairly simple design patterns where it may make sense to use zero-copy rather than cudaMemcpy.

When you have a large amount of data and you're not sure what will be needed. Suppose we have a large table of data, say 1GB, and the GPU kernel will need access to it. Suppose, also that the kernel design is such that only one or a few locations in the table are needed for each kernel call, and we don't know a-priori which locations those will be. We could use cudaMemcpy to transfer the entire 1GB to the GPU. This would certainly work, but it would take a possibly non-trivial amount of time (e.g. ~0.1s). Suppose also that we don't know what location was updated, and after the kernel call we need access to the modified data on the host. Another transfer would be needed. Using pinned/zero-copy methods here would mostly eliminate the costs associated with moving the data, and since our kernel is only accessing a few locations, the cost for the kernel to do so using zero-copy is far less than 0.1s.
When you need to check status of a search or convergence algorithm. Suppose that we have an algorithm that consists of a loop that is calling a kernel in each loop iteration. The kernel is doing some kind of search or convergence type algorithm, and so we need a "stopping condition" test. This might be as simple as a boolean value, that we communicate back to the host from the kernel activity, to indicate whether we have reached the stopping point or not. If the stopping point is reached, the loop terminates. Otherwise the loop continues with the next kernel launch. There may even be "two-way" communication here. For example, the host code might be setting the boolean value to false. The kernel might set it to true if iteration needs to continue, but the kernel does not ever set the flag to false. Therefore if continuation is needed, the host code sets the flag to false and calls the kernel again. We could realize this with cudaMemcpy:
```
 bool *d_continue;
 cudaMalloc(&d_continue, sizeof(bool));
 bool h_continue = true;
 while (h_continue){
   h_continue = false;
   cudaMemcpy(d_continue, &h_continue, sizeof(bool), cudaMemcpyHostToDevice); 
   my_search_kernel<<<...>>>(..., d_continue);
   cudaMemcpy(&h_continue, d_continue, sizeof(bool), cudaMemcpyDeviceToHost);
 }
```
The above pattern should be workable, but even though we are only transferring a small amount of data (1 byte), the cudaMemcpy operations will each take ~5 microseconds. If this were a performance concern, we could almost certainly reduce the time cost with:
```
 bool *z_continue;
 cudaHostAlloc(&z_continue, sizeof(bool), ...);
 *z_continue = true;
 while (*z_continue){
   *z_continue = false;
   my_search_kernel<<<...>>>(..., z_continue);
   cudaDeviceSynchronize();
 }
```

huseyin tugrul buyukisik · Answer 2 · 2021-01-08T16:20:47.773

For example, assume that you wrote a cuda-accelerated editor algorithm to fix spelling errors for books. If a 2MB text data has only 5 bytes of error, it would need to edit only 5 bytes of it. So it doesn't need to copy whole array from GPU VRAM to system RAM. Here, zero-copy version would access only the page that owns the 5 byte word. Without zero-copy, it would need to copy whole 2MB text. Copying 2MB would take more time than copying 5 bytes (or just the page that owns those bytes) so it would reduce books/second throughput.

Another example, there could be a sparse path-tracing algorithm to add shiny surfaces for few small objects of a game scene. Result may need just to update 10-100 pixels instead of 1920x1080 pixels. Zero copy would work better again.

Maybe sparse-matrix-multiplication would work better with zero-copy. If 8192x8192 matrices are multiplied but only 3-5 elements are non-zero, then zero-copy could still make difference when writing results.

Trying to understand in CUDA why is zero copy faster if it also travel through PCIe?

2 Answers2