2

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute on stream1, but with nsys I found kernelB is executed on a complete new stream. It is under-control. The scheduling graph is showed below:

enter image description here

Here is my code:

#include <iostream>

__global__ void kernelA() {}
__global__ void kernelB() {}
__global__ void kernelC() {}

int main() {
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);

  cudaGraphExec_t graphExec = NULL;
  cudaEvent_t event1, event2;
  cudaEventCreate(&event1);
  cudaEventCreate(&event2);

  for (int i = 0; i < 10; i++) {
    cudaGraph_t graph;
    cudaGraphExecUpdateResult updateResult;
    cudaGraphNode_t errorNode;
    cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);
    kernelA<<<512, 512, 0, stream1>>>();
    cudaEventRecord(event1, stream1);
    cudaStreamWaitEvent(stream2, event1, 0);
    kernelB<<<256, 512, 0, stream1>>>();
    kernelC<<<16, 512, 0, stream2>>>();
    cudaEventRecord(event2, stream2);
    cudaStreamWaitEvent(stream1, event2, 0);
    cudaStreamEndCapture(stream1, &graph);
    if (graphExec != NULL) {
      cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult);
    }
    if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) {
      if (graphExec != NULL) {
        cudaGraphExecDestroy(graphExec);
      }
      cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    }
    cudaGraphDestroy(graph);
    cudaGraphLaunch(graphExec, stream1);
    cudaStreamSynchronize(stream1);
  }
}
einpoklum
  • 118,144
  • 57
  • 340
  • 684
poohRui
  • 613
  • 5
  • 9
  • 1
    The graph stores dependencies between the kernels and executes them accordingly. The dependency "B after A, C after A" seems to be respected so its ok. The streams used by the captured graph when executed may also differ from the streams the graph was captured on. – Abator Abetor May 17 '22 at 05:15
  • Oh, thanks, so I can not control this behavior to make it run on the stream I wanted right? I am wondering if there is some api I don't know to control this. – poohRui May 17 '22 at 06:29
  • Why do you need a specific stream? Is something running on this stream on execution time of the graph? – Sebastian May 17 '22 at 07:37
  • @Sebastian, I want to do some experiments on combining kernels with different streams to accelerate the whole performance, especially in deep learning. So I have to control the scheduling of kernel in each streams. CUDA Graph is useful for reducing the CPU overhead but it is under-control which confused me a lot. – poohRui May 17 '22 at 08:16
  • Yes, but why specific streams? Or can the streams be renamed/renumbered, but the distribution on streams should be the same? In the original run you had kernel A and B on one and stream C on the other stream, now B and C are exchanged, but so late that it makes no difference? – Sebastian May 17 '22 at 08:19
  • The stream can be renamed, but the nsys graph showed above seems the distribution is not the same. Stream1 should have kernelA and kernelB but is kernelA and kernelC now. I don't think it makes no difference, for the overhead of kernelB and kernelC are different. – poohRui May 17 '22 at 08:20
  • Would it be faster with manual control? I think that with replaying graphs it is like a pool of streams, as the graph takes control of sequencing. – Sebastian May 17 '22 at 08:24
  • I am just exploring whether manual control will be faster. You mean cuda graph will automatically do the runtime cost analyze for me? – poohRui May 17 '22 at 08:26

2 Answers2

2

"An operation may be scheduled at any time once the nodes on which it depends are complete. Scheduling is left up to the CUDA system." Here.

bob
  • 21
  • 3
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 19 '22 at 14:22
2

I also ask in Nvidia Forums, Robert answered this question which help me a lot. Someone who are interested in the scheduling of cuda graph can also reference to this answer here.

poohRui
  • 613
  • 5
  • 9