4

I know of the existence of nvvp and nvprof, of course, but for various reasons nvprof does not want to work with my app that involves lots of shared libraries. nvidia-smi can hook into the driver to find out what's running, but I cannot find a nice way to get nvprof to attach to a running process.

There is a flag --profile-all-processes which does actually give me a message "NVPROF is profiling process 12345", but nothing further prints out. I am using CUDA 8.

How can I get a detailed performance breakdown of my CUDA kernels in this situation?

Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
  • Try to use the profiler from nsight ( eclipse or visual studio version). I’ve never tried the visual studio version, but in the eclipse version there’s a lot of tools wich can profile all the running kernels. – joão gabriel s.f. May 18 '18 at 11:12
  • 6
    None of the CUDA profilers support attaching to an already-running process that I am aware of. `nvprof` can profile independent processes as you've already discovered with `--profile-all-processes`, but this `nvprof` command must be issued (so that it is running in the background) before those other processes begin, and it must still be running when the other processes end. With that proviso, you should be able to profile a separate process using `nvprof`. With `--profile-all-processes`, nothing further will print out, but a profiler file will be written when those other processes terminate. – Robert Crovella May 18 '18 at 13:21
  • 1
    @RobertCrovella I can run `--profile-all-processes` before starting the processes, but I cannot work out where the profiler file is being written to! If I use `--log-file log%p.txt`, for instance, the log files appear for each process in the same directory as I am running `nvprof`, but with nothing more than just the single "Profile process ..." line. – Ken Y-N May 21 '18 at 02:31
  • 1
    you do need to specify an output file name, and I'm pretty sure you must include the %p so it can identify the file name with a process number. Beyond that, you need to let the application finish before any profiling result will be written (unless you call `cudaProfilerStop()` earlier in the app) and even after the application finishes, if there is a huge amount of profiling data to be processed, it can take minutes or more to write the files. – Robert Crovella May 21 '18 at 02:40
  • 3
    Here's the steps I followed: 1. start the background logging with `nvprof --profile-all-processes --log-file log%p.log &` The ampersand is optional if you want to use a separate process to launch things. At this point a (empty) log file appeared `log5699.log`. 2. I ran the numba cuda app I had been working on with `python t4.py`. that took about 10 seconds to run. When it had finished, a *new* log file appeared called `log5704.log`, and this file had the expected profiler output from process 5704 which was the python process. 3. I terminated background profiling with `kill 5699`. – Robert Crovella May 21 '18 at 02:53
  • I'll give that a go - `nvvp` is working for me today, though... I suspect this has been a PEBCAK, so I may delete this shortly. – Ken Y-N May 21 '18 at 03:03
  • @Ken Y-N Have you solved your problem? I'm having exactly the same problem. Robert Crovella, I don't think this is related to the log-file flag since without this flag output goes to stderr by default. But, in either way, the output remains empty. – phlegmax May 28 '18 at 08:56
  • 1
    Please do not delete this. RC's comments about how to run with `--profile-all-processes` are useful. – interestedparty333 Sep 13 '19 at 06:06

2 Answers2

2

As comments suggest, you simply have to make sure to start the CUDA profiler (now it's NSight Systems or NSight Compute, no longer nvprof) before the processes you want to profile. You could, for example, configure it to run on system startup.

Your inability to profile your application has nothing to do with it being an "app that involves lots of shared libraries" - the profiling tools profile such applications just fine.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
0

I've been looking for the process attach solution too but found no existing tool.

A possible direction is to use lower CUDA API to build a tool or integrate to your tool. See cupti: https://docs.nvidia.com/cupti/r_main.html#r_dynamic_detach

eval
  • 1,169
  • 12
  • 19