Investigating possible solutions for this problem, I thought about using CUDA graphs' host execution nodes (cudaGraphAddHostNode
). I was hoping to have the option to block and unblock streams on the host side instead of the device side with the wait kernel, while still using graphs.
I made a test program with two graphs. One graph does a host-to-device copy, calls a host function (with a host execution node) that waits in a loop for the "event" variable to stop being equal to 0
(i.e. waits for the "event"), then does a device-to-host copy. Another graph does a memset on the device memory, then calls a host function that sets the "event" variable to 1
(i.e. signals the "event"). I launch the first graph on one stream, the second on another, then synchronize on the first stream.
The result was that both graphs were launched as expected, the "wait" host function was executed, and the first stream was blocked successfully. However, even though the second graph was launched, the "signal" host function was never executed.
I realized that CUDA's implementation is likely serializing all host execution nodes in the context, so the "signal" node is forever waiting for the "wait" node to finish before executing. The documentation is even saying that "host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized".
I also tried launching the graphs from separate host threads, but that didn't work.
Is there some kind of way to make host execution nodes on different streams concurrent that I'm missing?