Does the host MemCpyAsync block until the previous copy is complete?
No, generally not, assuming by "block" you mean block the CPU thread. Asynchronous work items issued to the GPU go into a queue, and control is immediately returned to the CPU thread, before the work has begun. The queue can hold "many" items. Issuance of work can proceed until the queue is full without any other hindrances or dependencies.
It's important to keep in mind one of the two rules of stream semantics:
- Items issued into the same stream execute in issue order. Item B, issued after item A, will not begin until A has finished.
So let's say we had a case like this (and assume h_ibuff
and h_obuff
point to pinned host memory):
cudaStream_t stream;
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
}
on the second pass of the loop, the cudaMemcpyAsync
operations will be inserted into a queue, but will not begin to execute (or do anything, really) until stream semantics say they can begin. This is really true for each and every op issued by this loop.
A reasonable question might be "what if on each pass of the loop, I wanted different contents in h_ibuff
?" (quite sensible). Then you would need to address that specifically. Inserting a simple memcpy
operation for example, by itself, to "reload" h_ibuff
isn't going to work. You'd need some sort of synchronization. For example you might decide that you wanted to "refill" h_ibuff
while the kernel and subsequent cudaMemcpyAsync
D->H operation are happening. You could do something like this:
cudaStream_t stream;
cudaEvent_t event;
cudaEventCreat(&event);
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
cudaEventRecord(event, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
cudaEventSynchronize(event);
memcpy(h_ibuff, databuff+i*chunksize, chunksize); // "refill"
}
This refactoring would allow the asynchronous work to be issued to keep the GPU busy, and "overlap" the copy to "refill" the h_ibuff
. It will also prevent the "refill" operation from beginning until the previous buffer contents are safely transferred to the device, and will also prevent the next buffer copy from beginning until the new contents are "reloaded".
This isn't the only way to do this; it's one possible approach.
For this last question asked/answered above, you might ask a similar question: "how about handling the output buffer side?" The mechanism could be very similar, left to the reader.
For structured learning on this topic, you might wish to study the CUDA concurrency section of this lecture series.