Is there any way to fire an event (for benchmarking purposes, similar to cudaEvents in the CPU code) from a device kernel in CUDA?
E.g. suppose I would like to measure the time passed from kernel start to the first thread ever that starts a computation and the time passed from the last thread that leaves the computation to the CPU return.
Can I do that?