Using a loop in a CUDA graph

Question

I have kernel A, B, and C which need to be executed sequentially.

A->B->C

They are executed in a while loop until some condition will be met.

while(predicate) {
    A->B->C
}

The while loop may be executed from 3 to 2000 times - information about a fact that a loop should stopped is produced by kernel C.

As the execution is related to multiple invocations of relatively small kernels CUDA Graph sounds like a good idea. However, CUDA graph implementation I have seen are all linear or tree-like without loops.

Generally, if the loop is not possible, the long chain of kernels of the length 2000 with possibility of early stop invoked from kernel C would be also OK. However, is it possible to stop the graph execution in some position by the call from inside of the kernel?

einpoklum · Accepted Answer · 2022-07-23T16:30:50.097

1

CUDA graphs have no conditionals. A vertex of the graph is visited/executed when its predecessors are complete, and that's that. So, fundamentally, you cannot do this with a CUDA graph.

What can you do?

Have a smaller graph for the loop iteration, and repeatedly schedule it.
Have A, B and C start their execution by checking the loop predicate - and skip all work if it holds. With that being the case, you can schedule many instances of A->B->C->A->B->C etc - which, starting at some point, will do nothing.
Don't rely on the CUDA graphs API. It's not a general-purpose parallel execution mechanism. :-(

edited Jul 23 '22 at 16:30

answered Jan 17 '22 at 13:43

einpoklum

118,144
57
340
684

Thanks! and Would it be possible to use events for the idea 2 - I am afraid that if the kernels will run continously but doing real work only some small section of time it will lead to reduction of occupancy of the working kernel - I am thinking about those https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html – Jakub Mitura Jan 17 '22 at 16:37
1

@JakubMitura: Within a kernel, you can't work with events. What you can do is have a flag in global device memory (or in device-accessible host memory), and read that. – einpoklum Jan 17 '22 at 19:01

Using a loop in a CUDA graph

1 Answers1