Questions tagged [cuda-graphs]

Directed graphs whose nodes are CUDA operations (kernel launches, memory copies etc.) and edges are dependencies. They are used to schedule multiple inter-dependent operations for execution on GPUs with a single API call.

18 questions
3
votes
1 answer

Catching an exception thrown from a callback in cudaLaunchHostFunc

I want to check for an error flag living in managed memory that might have been written by a kernel running on a certain stream. Depending on the error flag I need to throw an exception. I would simply sync this stream and check the flag from the…
Raul
  • 41
  • 4
3
votes
1 answer

Call graphs for CUDA

I am trying to generate call graphs for a code that I have in CUDA with egypt but the usual way doesn't seem to work (since nvcc doesn't have any flag that can do the same thing as -fdump-rtl-expand). More details : I have a really large code (of…
Kostis
  • 447
  • 4
  • 13
2
votes
2 answers

Using multi streams in cuda graph, the execution order is uncontrolled

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute on stream1, but with nsys I found kernelB is…
poohRui
  • 613
  • 5
  • 9
2
votes
1 answer

How do the nodes in a CUDA graph connect?

CUDA graphs are a new way to synthesize complex operations from multiple operations. With "stream capture", it appears that you can run a mix of operations, including CuBlas and similar library operations and capture them as a singe…
MSalters
  • 173,980
  • 10
  • 155
  • 350
1
vote
1 answer

Using a loop in a CUDA graph

I have kernel A, B, and C which need to be executed sequentially. A->B->C They are executed in a while loop until some condition will be met. while(predicate) { A->B->C } The while loop may be executed from 3 to 2000 times - information about a…
1
vote
2 answers

cudaGraph: Multi-threaded stream capturing causes errors only when run in cuda-memcheck

I have a program where multiple host threads try to capture a cuda graph and execute it. It produces the correct results, but it cannot be run with cuda-memcheck. When run with cuda-memcheck, the following error appears. Program hit…
Abator Abetor
  • 2,345
  • 1
  • 10
  • 12
1
vote
5 answers

What is the use of task graphs in CUDA 10?

CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned. But what is the rationale…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

Behavior of cudaGraphInstantiateFlagUseNodePriority

My understanding of cudaGraphInstantiateFlagUseNodePriority is to prioritize the kernel calls. i.e. we have three independent kernels in cudaGraph first, second & third, and let each kernel waits for 1s and print its name. If we set kernel graph…
0
votes
1 answer

Is it possible to execute more than one CUDA graph's host execution node in different streams concurrently?

Investigating possible solutions for this problem, I thought about using CUDA graphs' host execution nodes (cudaGraphAddHostNode). I was hoping to have the option to block and unblock streams on the host side instead of the device side with the wait…
surabax
  • 15
  • 5
0
votes
1 answer

What should I set the flags field of CUDA_BATCH_MEM_OP_NODE_PARAMS?

The CUDA graph API exposes a function call for adding a "batch memory operations" node to a graph: CUresult cuGraphAddBatchMemOpNode ( CUgraphNode* phGraphNode, CUgraph hGraph, const CUgraphNode* dependencies, size_t numDependencies,…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

What type should be pointed to for the result of cuDeviceGetGraphMemAttribute()?

cuDeviceGetGraphMemAttribute() takes a void pointer to a result variable. But - what type does it expect the pointed-to value to be? The documentation (for CUDA v12.0) doesn't say. I'm guessing it's an unsigned 64-bit type, but I want to make sure.
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

How can I tell whether a copy-node search failed, or whether my node or graph are invalid?

Consider the CUDA graphs API function cuFindNodeInClone(). The documentation says, that it: Returns: CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE This seems problematic to me. How can I tell whether the search failed (e.g. because there is no copy of…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

simple cuda graph example doesn't product expected result

I am testing out cuda graphs. My graph is as follows. the code for this is as follows #include #include #include #include #include #define NumThreads 20 #define NumBlocks 1 template
0
votes
1 answer

Error with a captured CUDA graph and asynchronous memory allocations in a loop

I am trying to implement a cuda graph experiment. There are three kernels, kernel_0, kernel_1, and kernel_2. They will be executed sequentially and have dependencies. Right now I am going to only capture kernel_1. These are my code: #include…
kingwales
  • 129
  • 8
0
votes
1 answer

CUDA Graph Problem: Results not computed for the first iteration

I am trying to utilize CUDA Graphs for the computation of Fast Fourier Transform (FFT) using CUDA's cuFFT APIs. I modified the sample FFT code present on Github into the following FFT code using CUDA Graphs: #include #include…
skm
  • 5,015
  • 8
  • 43
  • 104
1
2