How bad is it to launch many small kernels in CUDA?

Question

I have a grid of rectangles. Each of these rectangles consists of a rectangular grid of points. All points inside the rectangle can be treated by exactly the same instruction sequence in a kernel. I will be able to launch a kernel with 10000s of points to handle, where each thread would handle about 10-50 points. The points on the edges and on the corners of the rectangles however will lead to a large set of different instruction sequences.

From a design point of view, it would be easier to launch a kernel for each set of points with the same instruction sequence. This would mean that some kernel launches would only treat very few points, probably less than 10.

So I would have maybe 4 kernel launches with 10000s of points to handle (10-50 points for each thread), and maybe 30-100 kernel launches with only a few points each (1 point per thread normally).

I have absolutely no idea whether this is acceptable or whether it will completely destroy my performance. I would be glad if you could give me a rough estimate or at least some hints, what to consider to get an estimate.

Jez · Accepted Answer · 2014-11-20T11:57:03.167

There are two factors here, which I'll call Launch overhead and Execution overhead.

Launch overhead: The overhead of launching a kernel is ~10us (ie. 0.01ms). It might be a bit less, it might be a bit more, and it will depend on your system as a whole as well as the kernel in question. This value assumes you're not running on Windows as a graphics card (ie. no WDDM).

This launch overhead can be completely hidden if you have a large non-blocking GPU call before the launch. One way to think of it is that you have a queue of tasks ready to be executed on the GPU, and you can add to that queue while something is being executed. The launch overhead is the cost of adding to the queue. As long as the queue has something in it, you won't see launch overheads starving the GPU.

Execution overhead: Once the kernel reaches the front of this queue it is executed. There's a small overhead here as well. I would expect this to be ~3-4us, though again, you mileage may vary. This is associated with initialization and moving data from global memory to get the kernel going. It also includes shutdown costs.

This execution overhead can be reduced by using streams. If you place your small kernels in a separate stream to a larger kernel, and have them execute concurrently, this execution overhead can be hidden by other computation on the GPU. You won't have the whole GPU waiting for a tiny problem to pass through it, instead only a small amount of resource will be waiting while the rest of the GPU continues to work on your main problem.

Thanks for this great answer! But does it also hold if a kernel launch consists of only one or very few threads? — Michael, Nov 21 '14 at 12:27
Yes. The cost of each will vary based on launch parameters, such as number of threads, but not by much. There are lots of other factors involved in launch a kernel which don't depend on number of threads, or can be done in parallel across threads. The above values are based on observed values for very small kernels, and I'd expect you to see similar. — Jez, Nov 21 '14 at 13:07
So, long story short: As long as you keep the device busy with big tasks, it won't cost you much to invoke small kernel launches in parallel. — Michael, Nov 21 '14 at 13:19
"This value assumes you're not running on Windows as a graphics card (ie. no WDDM)." - What if it's the case? — Serge Rogatch, Jun 28 '18 at 19:09

score 5 · Answer 2 · edited May 06 '22 at 13:34

Perhaps this should be an extended comment instead of an answer, but I hope it gives you some orientation anyway.

The performance limitation about launching many small kernels instead of a big one its due to the kernel launching overhead. This answer should explains a bit about it, and also links interesting resources.

But there are other ways to perform the task. Assuming you have that big grid of rectangles on your system (RAM) memory, you have to transfer it somehow to the GPU memory. That offers the chance of hide the small memory transfers time using a kernel-transfer overlapping approach, namely Asynchronous transfers. This approach could be effective only if your kernel takes enough time to complete the calculation of the rectangle.

If all your grid fits on your GPU main memory at once, then you can launch multiple child kernels from a master kernel. Here you can find more about the topic (Dynamic parallelism) and here is another interesting question about the slow-down of the approach. This approach may not yield any performance gain since it also take some time to launch those kernels, but it is an alternative to your proposal and maintains the simplicity hiding some complexity on your main code.

As a general advice, prefer few big data transfers over a large number of smaller ones because, and the same applies for kernels in order to minimise the overhead.

How bad is it to launch many small kernels in CUDA?

2 Answers2