Cuda Async MemCpy behavior when 'lapping' itself

Question

Say you have a fairly typical sequence of async compute:

Generate data on host, do other CPU stuff
Put data in async stream to device (via pinned memory)
compute data in (same) async stream

Now if you're running in a loop, the host will loop back to step 1 as 2&3 are non-blocking for the host.

The question is: "What happens for the host if it gets to step 2 again and the system isn't yet done transferring data to the device?"

Does the host MemCpyAsync block until the previous copy is complete?
Does it get launched like normal with the outbound data being put in a buffer?

If the latter, presumably this buffer can run out of space if your host is running too fast wrt the device operations?

I'm aware that modern devices have multiple copy engines, but I'm not sure if those would be useful for multiple copies on the same stream and to the same place.

I get that a system that ran into this wouldn't be a well designed one - asking as a point of knowledge.

This isn't something I have encountered in code yet - looking for any pointers to documentation on how this behavior is supposed to work. Already looked at the API page for the copy function and async behavior and didn't see anything I could recognize as relevant.

Robert Crovella · Accepted Answer · 2022-07-07T23:46:28.090

Does the host MemCpyAsync block until the previous copy is complete?

No, generally not, assuming by "block" you mean block the CPU thread. Asynchronous work items issued to the GPU go into a queue, and control is immediately returned to the CPU thread, before the work has begun. The queue can hold "many" items. Issuance of work can proceed until the queue is full without any other hindrances or dependencies.

It's important to keep in mind one of the two rules of stream semantics:

Items issued into the same stream execute in issue order. Item B, issued after item A, will not begin until A has finished.

So let's say we had a case like this (and assume h_ibuff and h_obuff point to pinned host memory):

cudaStream_t stream;
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
  cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
  kernel<<<...,stream>>>(...);
  cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
  }

on the second pass of the loop, the cudaMemcpyAsync operations will be inserted into a queue, but will not begin to execute (or do anything, really) until stream semantics say they can begin. This is really true for each and every op issued by this loop.

A reasonable question might be "what if on each pass of the loop, I wanted different contents in h_ibuff ?" (quite sensible). Then you would need to address that specifically. Inserting a simple memcpy operation for example, by itself, to "reload" h_ibuff isn't going to work. You'd need some sort of synchronization. For example you might decide that you wanted to "refill" h_ibuff while the kernel and subsequent cudaMemcpyAsync D->H operation are happening. You could do something like this:

cudaStream_t stream;
cudaEvent_t event;
cudaEventCreat(&event);
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
  cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
  cudaEventRecord(event, stream);
  kernel<<<...,stream>>>(...);
  cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
  cudaEventSynchronize(event);
  memcpy(h_ibuff, databuff+i*chunksize, chunksize); // "refill"
  }

This refactoring would allow the asynchronous work to be issued to keep the GPU busy, and "overlap" the copy to "refill" the h_ibuff. It will also prevent the "refill" operation from beginning until the previous buffer contents are safely transferred to the device, and will also prevent the next buffer copy from beginning until the new contents are "reloaded".

This isn't the only way to do this; it's one possible approach.

For this last question asked/answered above, you might ask a similar question: "how about handling the output buffer side?" The mechanism could be very similar, left to the reader.

For structured learning on this topic, you might wish to study the CUDA concurrency section of this lecture series.

So at the moment a memCpyAsync is invoked, the command goes into a queue but nothing actually happens to the data. The data involved in the copy is transferred when the copy comes off the queue. So without any sort of synchronization, it would be possible for an early memCpy that was backed up to transfer data meant for a 'later' loop of the host. Thanks for the pointer to the lecture series. — Treeman, Jul 08 '22 at 13:28

Cuda Async MemCpy behavior when 'lapping' itself

1 Answers1