2

CUDA graphs are a new way to synthesize complex operations from multiple operations. With "stream capture", it appears that you can run a mix of operations, including CuBlas and similar library operations and capture them as a singe "meta-kernel".

What's unclear to me is how the data flow works for these graphs. In the capture phase, I allocate memory A for the input, memory B for the temporary values, and memory C for the output. But when I capture this in a graph, I don't capture the memory allocations. So when I then instantiate multiple copies of these graphs, they cannot share the input memory A, temporary workspace B or output memory C.

How then does this work? I.e. when I call cudaGraphLaunch, I don't see a way to provide input parameters. My captured graph basically starts with a cudaMemcpyHostToDevice, how does the graph know which host memory to copy and where to put it?

Background: I found that CUDA is heavily bottlenecked on kernel launches; my AVX2 code was 13x times slower when ported to CUDA. The kernels themselves seem fine (according to NSight), it's just the overhead of scheduling several hundred thousand kernel launches.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
MSalters
  • 173,980
  • 10
  • 155
  • 350

1 Answers1

2

A memory allocation would typically be done outside of a graph definition/instantiation or "capture".

However, graphs provide for "memory copy" nodes, where you would typically do cudaMemcpy type operations.

At the time of graph definition, you pass a set of arguments for each graph node (which will depend on the node type, e.g. arguments for the cudaMemcpy operation, if it is a memory copy node, or kernel arguments if it is a kernel node). These arguments determine the actual memory allocations that will be used when that graph is executed.

If you wanted to use a different set of allocations, one method would be to instantiate another graph with different arguments for the nodes where there are changes. This could be done by repeating the entire process, or by starting with an existing graph, making changes to node arguments, and then instantiating a graph with those changes.

Currently, in cuda graphs, it is not possible to perform runtime binding (i.e. at the point of graph "launch") of node arguments to a particular graph/node. It's possible that new features may be introduced in future releases, of course.

Note that there is a CUDA sample code called simpleCudaGraphs available in CUDA 10 which demonstrates the use of both memory copy nodes, and kernel nodes, and also how to create dependencies (effectively execution dependencies) between nodes.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Slightly disappointing - I can't see why you would ever bother with multiple graphs trying to overwrite each other's results. And "making changes to the node arguments" doesn't look practically feasible when one of the nodes is a cublas `gemm` node; the generic graph interface doesn't know about the arguments of specific kernels. – MSalters Oct 12 '18 at 08:48
  • If capturing the arguments for all nodes is hard - an easier way might be to allow for a base pointer and capture the offsets from this. This way, you can allocate multiple arenas and launch a graph on each of them. – Tapan Chugh Oct 15 '18 at 07:56
  • In profiling I now see that I need dozens of `cudaGraph_t` objects (one per host CPU core), tens of thousands of `cudaGraphExec_t` (one per host CPU thread), and a complicated cache to avoid discarding graphs. The graph cache contention alone is a performance bottleneck. Also bad: you need to capture multiple `cudaGraph_t` objects, one per set of memory buffers, but `cudaStreamBeginCapture` is not thread/stream-safe. You get weird errors if you run two captures at the same time, even when they're on different threads and streams. – MSalters Oct 29 '18 at 12:25
  • Perhaps this answer should be updated by this point. – einpoklum Jul 23 '22 at 16:33