I have a function look like this:
int doSomething() {
<C++ host code>
<CUDA device code>
<C++ host code>
<...>
}
I would like to measure the running time of this function with high precision (at least millisecond) on Linux and on Windows too.
I know how I can measure the running time of a CUDA program with events, and I have found very accurate libraries for measuring the CPU time used by my process, but I want to measure the overall running time. I can't measure the two time differently and add them together because device code and host code can run parallel.
I want to use as few external library as possible, but I am interested in any good solution.