I'm trying to identify bottlenecks in GPU execution performance for deep learning models on Titan V / V100. I understand that certain requirements must be met for the underlying kernel execution to be performed on Tensor Cores based on https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/
"nvprof" provides an easy way to dump all the kernel executions on GPU, but it does not seem to say whether Tensor Cores were actually used or not. Is that a way to capture such info?