I've written a small PyTorch program which just does one 2D convolution on the GPU.
Now I would like to observe the kernel calls / runtimes with nvprof - is this possible and if so - how?
I am used to calling nvprof with my combined C++ CUDA program but not with a python script.
Can this be done?