I am getting a lot of profiling overhead when trying to profile my code using nvvp
(or with nvprof
):
Overall time is 98 ms and I'm getting 85 ms of "Instrumentation" in the first kernel launch.
How can I reduce this profiling overhead or otherwise zoom-in on just the part that I'm interested in?
Background
I am running this with "Start execution with profiling enabled" unchecked and I've limited the profiling using cudaProfilerStart
/cudaProfilerStop
like so:
/* --- generate data etc --- */
// Call the function once to warm up the FFT plan cache
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
// Call it once for profiling
cudaProfilerStart();
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
cudaProfilerStop();
where applyConvolution()
is the function that I'm profiling.
I am using CUDA Toolkit 8.0 on Ubuntu 16.04 with a GTX 1080.