When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, occupancy of the SM where the kernel is being launched, size of the kernel's arguments etc.
For the sake of this question, lets be inclusive, and define "overhead" as the sum of the following time intervals:
Start: An SM sees the launch instruction
End: An SM starts executing an instruction of the sub-kernel
plus
Start: Last SM executes any instruction of the sub-kernel (or perhaps last write by a sub-kernel instruction is committed to the relevant memory space)
End: Execution of the next instruction of the parent after the sub-kernel launch.