Maximum number of concurrent kernels & virtual code architecture

Question

Maximum number of resident grids per device (Concurrent Kernel Execution)

and for each compute capability it says a number of concurrent kernels, which I assume to be the maximum number of concurrent kernels.

Now I am getting a GTX 1060 delivered which according to this nvidia CUDA resource has a compute capability of 6.1. From what I have learned about CUDA so far you can specify the virtual compute capability of your code at compile time in NVCC though with the flag -arch=compute_XX.

So will my GPU be hardware constrained to 32 concurrent kernels or is it capable of 128 with the -arch=compute_60 flag?

Robert Crovella · Accepted Answer · 2016-12-11T23:02:54.397

3

According to table 13 in the NVIDIA CUDA programming guide compute capability 6.1 devices have a maximum of 32 resident grids = 32 concurrent kernels.

Even if you use the -arch=compute_60 flag, you will be limited to the hardware limit of 32 concurrent kernels. Choosing particular architectures to compile for does not allow you to exceed the hardware limits of the machine.

edited Dec 11 '16 at 23:02

answered Dec 11 '16 at 22:55

Robert Crovella

143,785
11
213
257

1

On a side note, 32 concurrent kernels are plenty and it is essentially impossible to ever hit this limit. As the GTX 1060 has either 9 or 10 SMs, even in the extreme case where you launch a long series of single block kernels where 3 fit onto an SM you still only reach 30 concurrent kernels. – tera Dec 11 '16 at 23:58
@tera, does the maximum concurrent kernel limit apply the same when using dynamic parallelism and nested kernels? – user2255757 Dec 12 '16 at 14:11
@tera I'm wondering why only 3 kernels can fit onto 1 SM. Are there any documents for this? – zingdle Oct 06 '19 at 12:43
3 blocks per SM is not a hard limit, but a typical value. Use the [occupancy calculator spreadsheet](https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html) or the [occupancy calculator API](https://devblogs.nvidia.com/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/) to find out how many blocks of a given kernel can fit onto one SM at the same time. – tera Oct 07 '19 at 09:10

yanbc · Answer 2 · 2023-04-20T02:01:42.993

1

Adding to the accepted answer, it is now Table 15 in the NVIDIA CUDA C Programming Guide as of April 2022, with the latest CUDA version being 12.1. Or, you can just search Technical Specifications per Compute Capability in the docs.

edited Apr 20 '23 at 02:01

answered Apr 20 '23 at 02:00

yanbc

21
3

Maximum number of concurrent kernels & virtual code architecture

2 Answers2