I wrote a shared library in Cuda c that I wrapped with Cython and that is called from within a larger Python project. I would like to have some information on what is going on in the GPU while running the shared library, like achieved occupancy, memory throughput, etc.
What I have in mind is EITHER to start and stop profiling from within the Cuda or Python code OR to start some continuous gpu monitor (similar to top, for instance) before running the code.
It seems that I cannot use nvprof or nvvp for this.