Using multi streams in cuda graph, the execution order is uncontrolled

Question

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute on stream1, but with nsys I found kernelB is executed on a complete new stream. It is under-control. The scheduling graph is showed below:

Here is my code:

#include <iostream>

__global__ void kernelA() {}
__global__ void kernelB() {}
__global__ void kernelC() {}

int main() {
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);

  cudaGraphExec_t graphExec = NULL;
  cudaEvent_t event1, event2;
  cudaEventCreate(&event1);
  cudaEventCreate(&event2);

  for (int i = 0; i < 10; i++) {
    cudaGraph_t graph;
    cudaGraphExecUpdateResult updateResult;
    cudaGraphNode_t errorNode;
    cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);
    kernelA<<<512, 512, 0, stream1>>>();
    cudaEventRecord(event1, stream1);
    cudaStreamWaitEvent(stream2, event1, 0);
    kernelB<<<256, 512, 0, stream1>>>();
    kernelC<<<16, 512, 0, stream2>>>();
    cudaEventRecord(event2, stream2);
    cudaStreamWaitEvent(stream1, event2, 0);
    cudaStreamEndCapture(stream1, &graph);
    if (graphExec != NULL) {
      cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult);
    }
    if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) {
      if (graphExec != NULL) {
        cudaGraphExecDestroy(graphExec);
      }
      cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    }
    cudaGraphDestroy(graph);
    cudaGraphLaunch(graphExec, stream1);
    cudaStreamSynchronize(stream1);
  }
}

The graph stores dependencies between the kernels and executes them accordingly. The dependency "B after A, C after A" seems to be respected so its ok. The streams used by the captured graph when executed may also differ from the streams the graph was captured on. — Abator Abetor, May 17 '22 at 05:15
Oh, thanks, so I can not control this behavior to make it run on the stream I wanted right? I am wondering if there is some api I don't know to control this. — poohRui, May 17 '22 at 06:29
Why do you need a specific stream? Is something running on this stream on execution time of the graph? — Sebastian, May 17 '22 at 07:37
@Sebastian, I want to do some experiments on combining kernels with different streams to accelerate the whole performance, especially in deep learning. So I have to control the scheduling of kernel in each streams. CUDA Graph is useful for reducing the CPU overhead but it is under-control which confused me a lot. — poohRui, May 17 '22 at 08:16
Yes, but why specific streams? Or can the streams be renamed/renumbered, but the distribution on streams should be the same? In the original run you had kernel A and B on one and stream C on the other stream, now B and C are exchanged, but so late that it makes no difference? — Sebastian, May 17 '22 at 08:19
The stream can be renamed, but the nsys graph showed above seems the distribution is not the same. Stream1 should have kernelA and kernelB but is kernelA and kernelC now. I don't think it makes no difference, for the overhead of kernelB and kernelC are different. — poohRui, May 17 '22 at 08:20
Would it be faster with manual control? I think that with replaying graphs it is like a pool of streams, as the graph takes control of sequencing. — Sebastian, May 17 '22 at 08:24
I am just exploring whether manual control will be faster. You mean cuda graph will automatically do the runtime cost analyze for me? — poohRui, May 17 '22 at 08:26

score 2 · Answer 1 · answered May 17 '22 at 13:10

2

"An operation may be scheduled at any time once the nodes on which it depends are complete. Scheduling is left up to the CUDA system." Here.

answered May 17 '22 at 13:10

bob

21
3

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 19 '22 at 14:22

score 2 · Answer 2 · answered May 18 '22 at 07:48

2

I also ask in Nvidia Forums, Robert answered this question which help me a lot. Someone who are interested in the scheduling of cuda graph can also reference to this answer here.

answered May 18 '22 at 07:48

poohRui

613
5
9

Using multi streams in cuda graph, the execution order is uncontrolled

2 Answers2