cudaStreamWaitEvent does not seem to wait

Question

I am attempting to write a small demo program that has two cuda streams progressing and, governed by events, waiting for each other. So far this program looks like this:

// event.cu
#include <iostream>
#include <cstdio>
#include <cuda_runtime.h>
#include <cuda.h>

using namespace std;

__global__ void k_A1() { printf("\tHi! I am Kernel A1.\n"); }
__global__ void k_B1() { printf("\tHi! I am Kernel B1.\n"); }
__global__ void k_A2() { printf("\tHi! I am Kernel A2.\n"); }
__global__ void k_B2() { printf("\tHi! I am Kernel B2.\n"); }

int main()
{
  cudaStream_t streamA, streamB;
  cudaEvent_t halfA, halfB;
  cudaStreamCreate(&streamA);
  cudaStreamCreate(&streamB);
  cudaEventCreate(&halfA);
  cudaEventCreate(&halfB);

  cout << "Here is the plan:" << endl <<
    "Stream A: A1, launch 'HalfA', wait for 'HalfB', A2." << endl <<
    "Stream B: Wait for 'HalfA', B1, launch 'HalfB', B2." << endl <<
    "I would expect: A1,B1, (A2 and B2 running concurrently)." << endl;

  k_A1<<<1,1,0,streamA>>>(); // A1!
  cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
  cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
  k_A2<<<1,1,0,streamA>>>(); // A2!

  cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
  k_B1<<<1,1,0,streamB>>>(); // B1!
  cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
  k_B2<<<1,1,0,streamB>>>(); // B2!

  cudaEventDestroy(halfB);
  cudaEventDestroy(halfA);
  cudaStreamDestroy(streamB);
  cudaStreamDestroy(streamA);

  cout << "All has been started. Synchronize!" << endl;
  cudaDeviceSynchronize();
  return 0;
}

My grasp of CUDA streams is the following: A stream is a kind of list to which I can add tasks. These tasks are tackled in series. So in my program I can rest assured that streamA would in order

Call kernel k_A1
Trigger halfA
Wait for someone to trigger halfB
Call kernel k_A2

and streamB would

Wait for someone to trigger halfA
Call kernel k_B1
Trigger halfB
Call kernel k_B2

Normally both streams might run asynchronous to each other. However, I would like to block streamB until A1 is done and then block streamA until B1 is done.

This appears not to be as simple. On my Ubuntu with Tesla M2090 (CC 2.0) the output of

nvcc -arch=sm_20 event.cu && ./a.out

is

Here is the plan:
Stream A: A1, launch 'HalfA', wait for 'HalfB', A2.
Stream B: Wait for 'HalfA', B1, launch 'HalfB', B2.
I would expect: A1,B1, (A2 and B2 running concurrently).
All has been started. Synchronize!
        Hi! I am Kernel A1.
        Hi! I am Kernel A2.
        Hi! I am Kernel B1.
        Hi! I am Kernel B2.

And I really would have expected B1 to be completed before the cudaEventRecord(halfB,streamB). Nevertheless stream A obviously does not wait for the completion of B1 and so not for the recording of halfB.

What's more: If I altogether delete the cudaEventRecord commands I would expect the program to lock down on the cudaStreamWait commands. But it does not and produces the same output. What am I overlooking here?

In the meantime I also checked for the possibility that maybe everything is okay and only the printf, being actually performed by the CPU (I guess... true?), is done in wrong order thus merely giving the impression of a problem. However, when my Kernels do some non-commutative math the results confirm the wrong order: A1,A2,B1,B2. — Markus-Hermann, Mar 21 '13 at 14:06

score 8 · Answer 1 · edited Jan 23 '18 at 02:22

I think this is because "cudaStreamWaitEvent(streamA,halfB,0); " was called before "halfB" was recorded (cudaEventRecord(halfB,streamB);). It's likely that the cudaStreamWaitEvent call was searching for the closed "halfB" prior to it; since it was not found, it just quietly moved forward. See the following documentation:

The stream stream will wait only for the completion of the most recent host call to cudaEventRecord() on event. Once this call has returned, any functions (including cudaEventRecord() and cudaEventDestroy()) may be called on event again, and the subsequent calls will not have any effect on stream.

I could not find a solution if you have to do a depth-first coding; however, the following code may lead to what you want:

  k_A1<<<1,1,0,streamA>>>(d); // A1!
  cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
  cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
  k_B1<<<1,1,0,streamB>>>(d); // B1!
  cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
  cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
  k_A2<<<1,1,0,streamA>>>(d); // A2!
  k_B2<<<1,1,0,streamB>>>(d); // B2!

which is confirmed by the profiling:

enter image description here

Note that I changed the kernel interfaces.

Cuda Profiler... neat. Will look into this. Anyway: Thx for your trouble which (along with the fact that so far no one [myself included] found a real solution) points in the "Will not work" direction! What I am _really_ looking for here is a blocking mechanism that will _not_ quietly move forward but halt the entire stream until either the whole context is ended or the event (or some timeout) is fired. Is there really no such thing in CUDA? — Markus-Hermann, Mar 25 '13 at 11:58
@Markus-Hermann, did you ever figure out a blocking mechanism that did what you wanted? I also find it very hard to believe CUDA does not have a wait-for-event that acts like a barrier — dag, Oct 07 '19 at 12:10
This line in the documentation seem to suggest to a barrier like construct: "If stream is NULL, any future work submitted in any stream will wait for event to complete before beginning execution. This effectively creates a barrier for all future work submitted to the device on this thread." — Jonathan, Nov 14 '21 at 21:13

score 1 · Answer 2 · answered Sep 21 '22 at 06:29

From the docs:

If cudaEventRecord() has not been called on event, this call acts as if the record has already completed, and so is a functional no-op.

https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/html/group__CUDART__STREAM_gfe68d207dc965685d92d3f03d77b0876.html#gfe68d207dc965685d92d3f03d77b0876

So we need to sort these lines so that the record is in the program before the eventwait. That is, for the stream of the event wait to be forced to run before the record, the record must be earlier in the code!

Here's the original code:

  k_A1<<<1,1,0,streamA>>>(); // A1!
  cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
  cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
  k_A2<<<1,1,0,streamA>>>(); // A2!

  cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
  k_B1<<<1,1,0,streamB>>>(); // B1!
  cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
  k_B2<<<1,1,0,streamB>>>(); // B2!

We see that the record of halfB is called on the second to last line but the wait is called above, on the third line. No good. So we re-order. The first thing on streamB is that wait and our only requirement is that is happen after the record. So that line can move up to be the third line.

Likewise, the k_B1 can follow it directly. And then the cudaEventRecord for halfB can be moved up before the waitevent. Hmm, does this prevent deadlock I wonder?

  k_A1<<<1,1,0,streamA>>>(); // A1!
  cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
  cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
  k_B1<<<1,1,0,streamB>>>(); // B1!
  cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
  cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
  k_A2<<<1,1,0,streamA>>>(); // A2!
  k_B2<<<1,1,0,streamB>>>(); // B2!

cudaStreamWaitEvent does not seem to wait

2 Answers2