4

I'm trying to identify bottlenecks in GPU execution performance for deep learning models on Titan V / V100. I understand that certain requirements must be met for the underlying kernel execution to be performed on Tensor Cores based on https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/

"nvprof" provides an easy way to dump all the kernel executions on GPU, but it does not seem to say whether Tensor Cores were actually used or not. Is that a way to capture such info?

n00b
  • 167
  • 1
  • 2
  • 8
  • 2
    currently `nvprof` doesn't offer the functionality you are suggesting. At the moment, I'm not aware of much in the way of monitoring/profiling tools for the TensorCore directly. However it should be possible in the visual profiler to witness TensorCore instructions [directly in the instruction stream](http://docs.nvidia.com/cuda/profiler-users-guide/index.html#source-assembly-view). – Robert Crovella Dec 20 '17 at 20:59
  • Thanks @RobertCrovella. I didn't know about the instruction stream drill down. That's pretty low level, but good to know. I've filed a ticket on NVIDIA Developer site hoping Tensor Core info can be added to the list of metrics supported in the future. – n00b Dec 21 '17 at 04:16

2 Answers2

4

According these slides presented by NVIDIA called "Training Neural Networks with Mixed Precision", you can use nvprof to see whether or not the Tensor Cores were used.

Page 12 of the slides essentially says to run the program with nvprof and look for "884" kernels.

E.g.

$ nvprof python test.py
...
37.024us 1 37.024us 37.024us 37.024us volta_fp16_s884gemm_fp16…
Jacob Beauchamp
  • 532
  • 5
  • 18
  • That applies to the deep learning framework discussed in the slide deck. It isn't general. There is no general application to any other code. – talonmies Oct 23 '18 at 18:23
  • 1
    @talonmies You're wrong. No matter the framework, `nvprof` will monitor GPU calls made by `test.py`. So this solution is general. This [doc](https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html) says that _To verify whether Tensor Cores are being used in your inference, you can profile your inference run with nvprof and check if all the GEMM CUDA kernels (GEMM is used by MatMul and convolution) have 884 in their name._ – MeanStreet Jan 15 '19 at 16:26
  • nvprof cannot be used on rtx 30 series. what is the solution now ? – Mehrdad Sep 27 '21 at 13:53
1

According to the docs, NVIDIA is now adding a new set of metrics called GPM (gpu performance metrics, I guess?) to NVML. Example metrics include tensor cores utilization (e.g. NVML_GPM_METRIC_ANY_TENSOR_UTIL) and cuda cores utilization (e.g. NVML_GPM_METRIC_FP64_UTIL).

Unfortunatly, I tried to query these metrics on my machine and they were not available. I have a GeForce 3090 so I guess maybe you need an Ada or even Hopper device to take advantage of these cool new features.

yanbc
  • 21
  • 3