1

CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.

But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?

PS - I skimmed this slide deck, but still didn't get it.

einpoklum
  • 118,144
  • 57
  • 340
  • 684

5 Answers5

1

Task graphs are quite mutable.

There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).

For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
1

My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.

1

Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.

Booo
  • 493
  • 3
  • 13
  • Looking at the API reference, it [seems](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1accfe1da0c605a577c22d9751a09597) a graph is launched on a single stream. – einpoklum Oct 18 '19 at 15:09
  • 1
    @einpoklum, yes, one stream is needed for graph launch, but the execution of the kernels are done in multiple streams(if any concurrency exists), you can see that in nvvp. Really useful feature, have tried few times already ;) – Booo Oct 18 '19 at 16:26
  • Maybe the comments in this [post](https://devblogs.nvidia.com/cuda-graphs/) is interesting. – Booo Oct 18 '19 at 16:33
  • So why do we even need to specify a stream then? :-( – einpoklum Oct 18 '19 at 16:44
  • 1
    I guess the reason is to allow multiple graph launch with concurrency. There is an explanation from [this slides](http://on-demand.gputechconf.com/gtc-kr/2018/pdf/HPC_Minseok_Lee_NVIDIA.pdf). Quote:"Branches in graph still execute concurrently even though graph is launched into a stream" – Booo Oct 18 '19 at 19:37
0

When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.

Artart
  • 1
  • 1
    I wouldn't bet on the optimizations you're hoping for based on knowing which kernels are planned. Just saying. – einpoklum Dec 05 '18 at 08:41
0

CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).

It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.

This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).

You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.

Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.

ipapadop
  • 1,432
  • 14
  • 19
  • I don't see how CUDA graphs solve the problem of "too many small kernels" any different than asynchronously dispatching work to the GPU. After all, every asynchronous kernel scheduling on a queue corresponds to adding a node to the graph, and vice versa; and the same goes for event dependencies and graph edges. – einpoklum Feb 04 '21 at 17:50
  • If you have 2 kernels, without CUDA graphs you pay the cost for async invocation 2 times. With CUDA graphs, you'll pay it X<2 times (there is one async invocation for both kernels). How much X is depends on how many shortcuts the NVIDIA software stack can take. – ipapadop Feb 05 '21 at 18:20