CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

Question

I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.

My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.

I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:

int *device_A;
int *device_B;
int *device_C;

cudaMalloc(device_A,sizeof(int)*N);

cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);

KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);

cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU

Nitpick: PCIe gen 3 x16 links typically are capable of moving data to/from a GPU at 12 GB/sec. PCIe links are full duplex, so for GPUs with two DMA engines up to 25 GB/sec are transported with some applications. If there are "many papers" on the benefits of kernel fusion, please provide citations for at least two. I am not aware that kernel fusion is a *general* method of improving the performance of memory-bound kernels, but I'd be happy to be proven wrong in that assessment. — njuffa, Nov 14 '18 at 17:40
Could you recommend some examples of improving memory bound programs? That would be very very useful. Here are a couple links: http://delivery.acm.org/10.1145/2690000/2683615/p191-wahib.pdf?ip=94.8.244.247&id=2683615&acc=ACTIVE%20SERVICE&key=C2D842D97AC95F7A%2E8C2422C056BE0E73%2E1A88D821C8341A94%2E4D4702B0C3E38B35&__acm__=1542214780_5f7e6350dfe97d619c7d386b1dda540b https://link.springer.com/article/10.1007/s11227-015-1483-z Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications. DOI: http://dx.doi.org/10.1145/2749246.2749255 — Gamma, Nov 14 '18 at 18:57
Additional information should be added to the question for clarification. Proper citations with underlying links are preferred to naked links (links can rot away). I *assume* that what these papers suggest is: Instead of (kernel 1) take array A, apply processing step 1, produce array B; (kernel 2) take array B, apply processing step 2, produce array C; use (kernel1&2) take array A, apply processing steps 1 and 2, produce array C. The fusion of kernels 1 and 2 therefore reduced memory bandwidth requirements. — njuffa, Nov 14 '18 at 19:31

Robert Crovella · Accepted Answer · 2023-02-09T01:25:00.540

The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:

by reusing the data that a kernel may have populated either in registers or shared memory
by reducing (i.e. eliminating) "redundant" loads and stores

Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:

Load my element of vector A from global memory
Add a constant to, or multiply by a constant, my vector element
Store my element back out to vector A (in global memory)

This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:

Add kernel:

Load my element of vector A from global memory
Add a value to my vector element
Store my element back out to vector A (in global memory)

Multiply kernel:

Load my element of vector A from global memory
Multiply my vector element by a value
Store my element back out to vector A (in global memory)

We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.

But if we combine the two kernels together, we could write a kernel like this:

Load my element of vector A from global memory
Add a value to my vector element
Multiply my vector element by a value
Store my element back out to vector A (in global memory)

This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.

This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.

Here is a trivial code example.

why don't we always use global memory and fuse all operations together? this is a silly question, but want to know how that's wrong and what limits there are. For example, when multi branches merge or downsampling, will that mess it up? — moon, Apr 26 '22 at 01:12

CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

1 Answers1

Linked