Why don't GPU libraries support automated function composition?

Question

Intel's Integrated Performance Primitives (IPP) library has a feature called Deferred Mode Image Processing (DMIP). It lets you specify a sequence of functions, composes the functions, and applies the composed function to an array via cache-friendly tiled processing. This gives better performance than naively iterating through the whole array for each function.

It seems like this technique would benefit code running on a GPU as well. There are many GPU libraries available, such as NVIDIA Performance Primitives (NPP), but none seem to have a feature like DMIP. Am I missing something? Or is there a reason that GPU libraries would not benefit from automated function composition?

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

GPU programming has similar concepts to DMIP function composition on CPU. Although it's not easy to be "automated" on GPU (some 3rd party lib may be able to do it), manually doing it is easier than CPU programming (see Thrust example below).

Two main features of DMIP:

processing by the image fragments so data can fit into cache;
parallel processing to different fragments or execution of different independent branches of a graph.

When applying a sequence of basic operations on a large image, feature 1 omits RAM read/write between basic operations. All the read/write is done in cache, and feature 2 can utilize multi-core CPU.

The similar concept of DMIP feature 1 for GPGPU is kernel fusion. Instead of applying multiple kernels of the basic operation to image data, one could combine basic operations in one kernel to avoid multiple GPU global memory read/write.

A manually kernel fusion example using Thrust can be found in page 26 of this slides.

It seems the library ArrayFire has made notable efforts on automaic kernel fusion.

The similar concept of DMIP feature 2 for GPGPU is concurrent kernel execution. This feature enlarge the bandwidth requirement, but most GPGPU programs are already bandwidth bound. So the concurrent kernel execution is not likely to be used very often.

CPU cache vs. GPGPU shared mem/cache

CPU cache omits the RAM read/write in DMIP, while for a fused kernel of GPGPU, registers do the same thing. Since a thread of CPU in DMIP processes on a small image fragment, but a thread of GPGPU often processes only one pixel. A few registers are large enough to buffer the data of a GPU thread.

For image processing, GPGPU shared mem/cache is often used when the result pixel depends on surrounding pixels. Image smoothing/filtering is a typical example requiring GPGPU shared mem/cache.

I'm not sure what you mean by "processing pipeline". I think of DMIP as taking a small portion of the array that will fit in cache, applying all functions, then moving to the next portion of the array. I know that GPUs have memory that is sort of like a cache but is managed manually. It seems like there would still be a benefit to DMIP-style behavior. Wouldn't it reduce bandwidth requirements as data doesn't have go back and forth from global GPU memory? — shoelzer, Jan 10 '13 at 14:13
@shoelzer I may misuse the term pipeline. I rewrite the answer. I think kernel fusion in GPGPU achieve the DMIP-style behavior as you mentioned. But shared mem/cache in GPGPU is another thing. — kangshiyin, Jan 10 '13 at 16:09
Thanks @Eric. Your expanded answer helps. I guess what I was really asking about is what you call automatic kernel fusion. Do you have a link showing what ArrayFire can do? — shoelzer, Jan 10 '13 at 19:13
@shoelzer A benchmark code in one answer of my [question](http://stackoverflow.com/questions/14211093/how-to-normalize-matrix-columns-in-cuda-with-max-performance) demos the ArrayFire code. When profiling the code, I think the kernel `main_kernel` does multiple operations specified by c++ code, but I'm not quite sure about that. And I think the benchmark result is biased. Please ignore it. — kangshiyin, Jan 10 '13 at 19:28

Why don't GPU libraries support automated function composition?

1 Answers1

CPU cache vs. GPGPU shared mem/cache