In CUDA programming, I try to reduce the synchronization overhead between the off-chip memory and on-chip memory if there is data dependency between two kernels? What's the differences between these two techniques?
1 Answers
The idea behind kernel fusion is to take two (or more) discrete operations, that could be realized (and might already be realized) in separate kernels, and combine them so the operations all happen in a single kernel.
The benefits of this may or may not seem obvious, so I refer you to this writeup.
Persistent threads/Persistent kernel is a kernel design strategy that allows the kernel to continue execution indefinitely. Typical "ordinary" kernel design focuses on solving a particular task, and when that task is done, the kernel exits (at the closing curly-brace of your kernel code).
A persistent kernel however has a governing loop in it that only ends when signaled - otherwise it runs indefinitely. People often connect this with the producer-consumer model of application design. Something (host code) produces data, and your persistent kernel consumes that data and produces results. This producer-consumer model can run indefinitely. When there is no data to consume, the consumer (your persistent kernel) simply waits in a loop, for new data to be presented.
Persistent kernel design has a number of important considerations, which I won't try to list here but instead refer you to this longer writeup/example.
Benefits:
Kernel fusion may combine work into a single kernel so as to increase performance by reduction of unnecessary loads and stores - because the data being operated on can be preserved in-place in device registers or shared memory.
Persistent kernels may have a variety of benefits. They may possibly reduce the latency associated with processing data, because the CUDA kernel launch overhead is no longer necessary. However another possible performance factor may be the ability to retain state (similar to kernel fusion) in device registers or shared memory.
Kernel fusion doesn't necessarily imply a persistent kernel. You may simply be combining a set of tasks into a single kernel. A persistent kernel doesn't necessarily imply fusion of separate computation tasks - there may be only 1 "task" that you are performing in a governing "consumer" loop.
But there is obviously considerable conceptual overlap between the two ideas.

- 143,785
- 11
- 213
- 257
-
If the state dependencies are only from layer to layer, and don't "persist" from one pass through the network to the next, then kernel fusion makes sense. NVIDIA TensorRT uses [this methodology](https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#layer-fusion) for inferencing, for example. But if the state dependencies persist from one pass through the network to the next, eg. for training, then a persistent kernel design to preserve e.g. "weights", as described by [baidu](https://pdfs.semanticscholar.org/5617/2b6fe2613c37d9790bde8ab6ccda14b35678.pdf) may be sensible. – Robert Crovella Oct 15 '19 at 22:27
-
Hi Robert. Thanks for the clarification. It did really help me. Here I have one case, for example, for DNN or RNN, if there are dependencies between kernels, lets sya that some layers need the output from previous layers, which one is better to avoid the synchronization overhead? – hash join Oct 15 '19 at 22:31
-
Hi Thanks. I also notice that all of these techniques are applied for RNN, rarely for DNN. Is there anything specific? – hash join Oct 15 '19 at 22:35
-
For the kernel fusion, If we fuse multiple kernel, we can obtain a larger one. How if the GPU memory is not sufficient to host the larger one? – hash join Oct 15 '19 at 22:37
-
I don't think there is anything specific to RNN or DNN in all of this. The comments about TensorRT certainly apply to DNN. If the GPU memory is not sufficient to accomplish what you are trying to do, then you obviously can't do that, and will need to come up with some other approach. Modern GPUs support oversubscription of memory, for example. – Robert Crovella Oct 15 '19 at 22:57
-
[paper](http://proceedings.mlr.press/v48/diamos16.pdf) and [code](https://github.com/baidu-research/persistent-rnn) – Robert Crovella Oct 15 '19 at 23:04
-
Hi, Robert. If the GPU memory is not sufficient to host the fused kernel in the RNN training, would the persistent thread be the better to avoid these synchronization overhead. I mean the back-up solution. – hash join Oct 16 '19 at 15:58
-
Hi Robert, Can you please have a look at my case? Because my GPU memory cannot host a large fused kernel. – hash join Oct 16 '19 at 19:10