What factors effect the overhead of dynamic parallelism kernel launches?

Question

When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, occupancy of the SM where the kernel is being launched, size of the kernel's arguments etc.

For the sake of this question, lets be inclusive, and define "overhead" as the sum of the following time intervals:

Start: An SM sees the launch instruction
End: An SM starts executing an instruction of the sub-kernel

plus

Start: Last SM executes any instruction of the sub-kernel (or perhaps last write by a sub-kernel instruction is committed to the relevant memory space)
End: Execution of the next instruction of the parent after the sub-kernel launch.

Is the "overhead" you mentioned includes only the period from launch call to when child kernel starts? — xhg, Mar 21 '17 at 04:57

score 1 · Answer 1 · answered Mar 22 '17 at 20:25

1

This answer is not based on experiments or knowledge of the device - side runtime implementation, rather a thought on what needs to be done to perform the operation.

I assume the grid configuration and register usage of the launching has some effect as the state needs to be saved somewhere to have the SM move on to another kernel. Also, the number of blocks launched may have some impact as I don't see how the device runtime handle all configurations. On the other hand, I don't see why the callee register usage/code size would have huge impact.

Again, no test/experiment is here to prove any of the above.

answered Mar 22 '17 at 20:25

Florent DUGUET

2,786
16
28

I would tend to disagree - again, not based on empirical data - with at least two suggested factors. Number of blocks: The device would just keep that as a number somewhere, and dispatch more blocks as needed; there's no pre-allocation of anything for all the blocks. Number of registers: There's no initialization of register values before the launch of a new kernel, so I don't see why that would be an issue. Of course, I could be wrong, but I do actually want some hard facts on this... – einpoklum Mar 22 '17 at 20:31
@einpokum: Re the number of blocks and registers: The state of the running kernel needs to be saved somewhere to continue execution once callee kernel is done running. Saving this state requires resources that depend on the number of registers used to restore their values, and this for each block. Maybe there is some secret way of storing registers values that does not rely on main memory, then, it would be great to have access to it... – Florent DUGUET May 02 '17 at 12:06

What factors effect the overhead of dynamic parallelism kernel launches?

1 Answers1