0

I'm looking to collect a trace of events that take place at the device level on GPU.

Background / Analogy on CPU:

On a CPU, when a process A is running, it might be interrupted by another user-level process B, system/kernel processes, various kinds of interrupts such as hardware interrupts, network interrupts, hypervisor related interrupts, etc. To measure these, I would ideally have to make a kernel patch which would capture the start and end times of all processes and interrupts in the scheduler and interrupt tray. Make these kernel data structures visible to the user-level, and then read them repeatedly from a user-level program.

I want to do something similar for the GPU. How do I capture the timestamps of these interrupts and background processes? In the literature I saw that nvidia-smi can be used for gathering timestamp, but I'm very unclear on how to actually instrument the GPU to get what I need.

Can anybody point out references or tell me how to instrument the GPU to get timestamps? Or specifically, use nvprof, cuda-memcheck for the same purpose?

complextea
  • 393
  • 1
  • 5
  • 16

1 Answers1

1

You can get timestamps using clock() or clock64() functions. You can use these e.g. to capture start and end times of blocks and learn how the block scheduler works.

You can also instrument your code to time specific parts of your kernels. This can be used to gain a surprising amount of insight into the internal workings of the GPU.

In the early days of CUDA I used this a lot when tuning code. However nowadays the nvvp profiler is so good that manual code instrumentation is rarely needed.

Note however that SMs don't have interrupts in the same way as CPUs do. Newer GPUs are able to suspend long-running kernels to allow the GUI to remain interactive, especially during debugger sessions. But there are no interrupts to handle I/O or for scheduling, because I/O hardware is all managed by the host, and scheduling is performed entirely in hardware. Similarly there are no background processes, because such tasks are much better handled by the CPU.

tera
  • 7,080
  • 1
  • 21
  • 32