Optimal use of GPU resources in case of many interdependent tasks

Question

In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.

My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?

An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).

What's wrong with using CUDA streams and events? Record an event when an operation is done (```cudaEventRecord```). Make dependent operations wait on all input dependencies (```cudaStreamWaitEvent```). Use independent streams to distribute your work. — Homer512, Jan 07 '22 at 11:30
Starting blocks and then waiting is not a good strategy with Cuda. Go with Homer512's comment (and potentially even record and replay with Cuda the launches as graph) or use Dynamic Parallelism or join several functions (which process chunks sequentially) in one kernel or even work with resident blocks. It all depends, how much parallelizable work (how many threads) running one function on one chunk would give. I assume not all chunks are processed by the same functions, or are they? How many functions are typically started after one completes? — Sebastian, Jan 07 '22 at 14:14
Is this graph fixed and could be worked into your program or is it user-defined and could be dense or sparse? Perhaps you can show an example graph and/or give some numbers of typical graphs (complexity of kernels, number of chunks, parallelizability of chunks (and whether those parallelizable threads per chunk function have to cooperate/sync). — Sebastian, Jan 07 '22 at 14:15
Thank you for your inputs! Here are some numbers to give you a more concrete idea of the problem. The graph has somewhere between 50-500 nodes. A node has almost always less than 10 parents/childs and mostly only 1-3. This graph is not fixed. it would readily be possible to merge nodes for example or split them and process them sequentially. All chunks of data have can be assumed to have the same size (1000-10000 floats). Within each function/node, no synchronization between threads is required. The functions are fairly simple. — Felix Crazzolara, Jan 07 '22 at 14:28
Seems like CUDA now has a direct solution for this kind of problem. The Graph API: https://developer.nvidia.com/blog/cuda-graphs/ — Homer512, Jan 07 '22 at 14:30
Say that function/node F takes data chunks A,B,C as input and produces D as output. Denoting indices with i, D[i] will be a fairly simple function of A[i],B[i],C[i]. Say some basic arithmetic and maybe a bit of exp/sin/cos,etc. Usually, less than 5 operations in total. All functions will need to run when processing the graph. I'm interested in the results of the leaf nodes of the tree, so every time the graph is processed, every function must run once. Dependencies between nodes stay always the same. — Felix Crazzolara, Jan 07 '22 at 14:32
Do all functions process 1 float individually (1 float in -> 1 float out) or each thread combines as input the whole chunk, but writes 1 float as output? In the first case the most performant solution would be to generate and compile the CUDA program dynamically after receiving the graph into one kernel doing the complete processing. Even 1000 to 10000 floats can be stored in shared memory. Do you want a performant solution or just a very flexible one? — Sebastian, Jan 07 '22 at 14:40
@Sebastian Thanks again! All functions need to process the entire input data chunks. A float D[i] in output data chunk D never depends on a float A[j] of input data chunk A with i != j. I implicitly assumed that within the processing of a function a single thread would perform the all required computations for a given index i. I hope it's clear that under this assumption, threads handling the same function are completely independent from each other. I'd rather prefer a solution which can quite easily be changed than the most fastest one. At the moment, all nodes are processed on the GPU,... — Felix Crazzolara, Jan 07 '22 at 14:53
but initiated by Python function calls. Also they all happen sequentially. Hence, having a more parallelized solution would already be much of an improvement for my application. — Felix Crazzolara, Jan 07 '22 at 14:54

score 0 · Accepted Answer · answered Jan 07 '22 at 15:51

I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).

Solution 1: Runtime Compilation (sequential, but optimized)

Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.

Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.

A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);

Solution 2: Controlled run on device (sequential)

The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).

On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.

How to specify, which functions to call?

The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).

Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.

Solution 3: Dynamic Parallelism (parallel)

With Dynamic Parallelism kernels themselves start other kernels (grids).

https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism

There is a maximum depth of 24. The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.

All this swapping could make the parallel version slower again.

But the advantage would be that nodes can really be run in parallel.

Solution 4: Use Cuda Streams and Events (parallel)

Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.

Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).

For all methods

You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.

einpoklum · Answer 2 · 2022-01-07T23:08:50.800

Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.

Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.

Traditional plain-vanilla approach

You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.

Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.

The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.

Using CUDA task graphs

With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.

For an elaboration on how to do this, please read:

Getting Started with CUDA Graphs

on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .

White-box functions

If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.

Optimal use of GPU resources in case of many interdependent tasks

2 Answers2

Traditional plain-vanilla approach

Using CUDA task graphs

White-box functions