Non-blocking synchronization of streams in CUDA?

Question

Is it possible to synchronize two CUDA streams without blocking the host? I know there's cudaStreamWaitEvent, which is non-blocking. But what about the creation and destruction of the events using cudaEventCreate and cudaEventDestroy.

The documentation for cudaEventDestroy says:

In case event has been recorded but has not yet been completed when cudaEventDestroy() is called, the function will return immediately and the resources associated with event will be released automatically once the device has completed event.

What I don't understand here is what the difference is between a recorded event and a completed event. Also this seems to imply that the call is blocking if the event has not yet been recorded.

Anyone who can shed some light on this?

An event is created when you call `cudaEventCreate()` on it. An event is recorded when you call `cudaEventRecord()` on it. An event is completed when the processing of a stream that an event has been recorded into, reaches that event. For example, if I record an event into a stream immediately after a kernel call, then the event will be recorded but incomplete, until the kernel call has finished processing. Once the kernel call finishes processing, the recorded event after it will be marked complete (and stream processing will continue.) — Robert Crovella, Aug 08 '16 at 10:51
A `cudaEventDestroy` call is **not** blocking if the event has not yet been **recorded**. — Robert Crovella, Aug 08 '16 at 10:53

score 5 · Accepted Answer · answered Aug 08 '16 at 15:41

5

You're on the right track by using cudaStreamWaitEvent. Creating events does carry some cost, but they can be created during your application start-up to prevent the creation time from being costly during your GPU routines.

An event is recorded when you you put the event into a stream. It is completed after all activity that was put into the stream before the event has completed. Recording the event basically puts a marker into your stream, which is the thing that enables cudaStreamWaitEvent to stop forward progress on the stream until the event has completed.

answered Aug 08 '16 at 15:41

jefflarkin

1,279
6
14

I cannot create the events at start-up because I don't know how often I will have to synchronize. Also I want to put work on other execution streams so I need asynchronous behaviour on the host. But I got the difference between recorded and completed events so thanks for that. – spfrnd Aug 09 '16 at 08:03
1

FWIW I just timed creating and destroying 1000 events. On average creation was about 115us and destruction didn't even register on the timer. NVPROF reports times around 500ns normally with some outliers. It looks like roughly 1 in every 10-15 creations takes longer than the others, dragging my average down. Hopefully this won't cause too much synchronization for your needs. – jefflarkin Aug 09 '16 at 14:02

Non-blocking synchronization of streams in CUDA?

1 Answers1