Does launching multiple CUDA kernels involve going back to host for each kernel?

Question

If I am launching multiple CUDA kernels at the same context, and there are dependencies between the kernels (output of the first one in an input to a second one etc.), does the control go back to host after each kernel finished its execution? If not, can you please describe briefly how the "kernel enqueue" mechanism works on CUDA cards?

Yes, it does. Unless you call kernels asynchronously (with CUDA streams), It will launch first kernel, wait before it is finished, and then launch the second, etc. I am not sure what you meant by "control goes back to host", as long as host always has control (as far as I understand, I am not a good expert). — Mikhail Genkin, Feb 18 '15 at 00:44