1

I am trying to implement the following kind of pipeline on the GPU with CUDA:

Pipeline

I have four streams with each a Host2Device copy, a kernel call and a Device2Host copy. However, the kernel calls have to wait for the Host2Device copy of the next stream to finish.

I intended to use cudaStreamWaitEvent for synchronization. However, according to the documentation, this only works if cudaEventRecord has been called earlier for the according event. And this is not the case in this scenario.

The streams are managed by separate CPU threads which basically look as follows:

Do some work ...
cudaMemcpyAsync H2D
cudaEventRecord (event_copy_complete[current_stream])
cudaStreamWaitEvent (event_copy_complete[next_stream])
call kernel on current stream
cudaMemcpyAsync D2H
Do some work ...

The CPU threads are managed to start the streams in the correct order. Thus, cudaStreamWaitEvent for the copy complete event of stream 1 is called (in stream 0) before cudaEventRecord of that very event (in stream 1). This results in a functional no-op.

I have the feeling that events can't be used this way. Is there another way to achieve the desired synchronization?

Btw, I can't just reverse the stream order because there are some more dependencies.

API call order

As requested, here is the order in which CUDA calls are issued:

//all on stream 0
cpy H2D
cudaEventRecord (event_copy_complete[0])
cudaStreamWaitEvent (event_copy_complete[1])
K<<< >>>    
cpy D2H

//all on stream 1
cpy H2D
cudaEventRecord (event_copy_complete[1])
cudaStreamWaitEvent (event_copy_complete[2])
K<<< >>>    
cpy D2H

//all on stream 2
cpy H2D
cudaEventRecord (event_copy_complete[2])
cudaStreamWaitEvent (event_copy_complete[3])
K<<< >>>    
cpy D2H
...

As can be seen, the call to cudaStreamWaitEvent is always earlier than the call to cudaEventRecord.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Nico Schertler
  • 32,049
  • 4
  • 39
  • 70
  • How about using thread synchronization primitives? – user703016 Nov 26 '14 at 11:09
  • @ParkYoung-Bae What primitives do you have in mind? The synchronization of CPU threads is no problem, but rather the synchronization of the GPU pipeline. – Nico Schertler Nov 26 '14 at 12:50
  • Well, if your problem comes from the fact that the CPU threads don't issue the calls to the CUDA API in order, then it's the CPU threads you have to synchronize. When those calls are issued in order, you will have solved your GPU pipeline synchronization problem. For a particular primitive that could solve your problem (if I'm understanding it correctly), condition variables come to mind. – user703016 Nov 26 '14 at 13:55
  • The GPU calls are issued in order. The only problem is that the kernel execution must wait for the copy operation from another stream to finish. – Nico Schertler Nov 26 '14 at 14:02
  • Your last comment seems to be contradicting the following: `cudaStreamWaitEvent for the copy complete event of stream 1 is called (in stream 0) before cudaEventRecord of that very event (in stream 1)`, which, to me, means that you aren't issuing the API calls in a correct, total order. Could you please edit your question to clarify what you mean? Or provide a timeline of API calls. – user703016 Nov 26 '14 at 14:14
  • What CPU threading model are you using? OpenMP? pthreads? std::thread? – Robert Crovella Nov 26 '14 at 15:30
  • @RobertCrovella OpenMP. But as mentioned before, CPU threading is not the problem. – Nico Schertler Nov 26 '14 at 18:12
  • I didn't say CPU threading is the problem. The *only* question I see is this one: "Is there another way to achieve the desired synchronization?" Have you considered using CPU thread synchronization to try to accomplish this (as the very first comment suggested?) Unfortunately, OMP is probably a less useful threading model in this respect. – Robert Crovella Nov 26 '14 at 18:36
  • @RobertCrovella I guess CPU synchronization methods cannot be used for GPU synchronization because the calls are issued asynchronously. – Nico Schertler Nov 26 '14 at 18:41
  • 1
    I don't think that is the right way to look at it. You could create a global condition (e.g. semaphore), initially set to zero. Once the OMP thread in stream 1 issues the H2D and the event record operation (where you would then like to record the event), set the condition to 1. The OMP thread responsible for stream 0, after issuing the H2D, waits on that condition while it is zero. Once it changes to 1, it issues the cudaWaitEvent call (the event is now valid), on the newly recorded event (and then issues the kernel call, etc.). I can construct something like that using pthreads. – Robert Crovella Nov 26 '14 at 19:05
  • @RobertCrovella Yeah, that could work. I already have some similar synchronization. I'll try that out. – Nico Schertler Nov 26 '14 at 19:31
  • I had in mind exactly what Robert Crovella suggested. If you want to make that an answer, I would provide an upvote. – user703016 Nov 26 '14 at 19:32

1 Answers1

3

If at all possible, you should be dispatching all this GPU work from a single CPU thread. That way, (at the risk of stating the obvious), the order in which the API calls are performed can be inferred from the order in which they appear in your code. Because the cudaEventRecord() and cudaStreamWaitEvent() calls both operate on progress values associated with the CUDA context, the exact order of API calls is important. cudaEventRecord() records the current progress value, then increments it; cudaStreamWaitEvent() emits a command for the current GPU to wait on the event's current progress value. (That's why if you reverse the order of the calls, the wait becomes an effective no-op.)

If the API calls are being made from different threads, you will have to do a lot of thread synchronization to generate the desired result, which also negatively impacts performance. In fact, if you need the multiple CPU threads for performance reasons, you may want to restructure your code to delegate CUDA calls onto a single CPU thread to enforce the ordering.

ArchaeaSoftware
  • 4,332
  • 16
  • 21