4

I am interested in a way to measure the detailed performance of a custom Tensorflow Op when run on a GPU.

So far I have tried the approach of this post using a Timeline, as well as the internal Tensorflow Profiler (tf.profiler.Profiler). Both deliver very similar results, which are fine if I want to investigate a Network, but for profiling a single Op the output is too coarse and doesn't include intra-op calculations (at least I couldn't find a way for that). My next try was using the CUDA profiler nvprof (or nvvp for that matter), which is more in the right direction and displays single calls to CUDA kernels and Memory allocations. But now, the CPU calculations are not included. I tried running nvprof --cpu-profiling on, but now the profiler never finishes (see here)

My scenario is the following: I wrote a custom Op that is very similar to a convolution in 2D and should not take much more time to calculate. In a Network, my custom Op's performance is about 3 times worse than tf.nn.conv2d. Using the tf.profiler.Profiler I get the following:

Profile:
node name                    | requested bytes             | total execution time          | accelerator execution time    | cpu execution time
CustomConv2DBackpropInput     72.09MB (100.00%, 7.04%),     194.36ms (100.00%, 38.05%),     49.82ms (100.00%, 17.61%),      144.54ms (100.00%, 63.44%)
CustomConv2D                  65.54MB (92.96%, 6.40%),      95.41ms (61.95%, 18.68%),       45.16ms (82.39%, 15.96%),       50.25ms (36.56%, 22.06%)
CustomConv2DBackpropFilter    134.48MB (86.55%, 13.14%),    72.39ms (43.27%, 14.17%),       41.22ms (66.44%, 14.56%),       31.17ms (14.50%, 13.68%)
Conv2DBackpropFilter          294.68MB (73.41%, 28.79%),    63.39ms (29.10%, 12.41%),       62.80ms (51.87%, 22.19%),       594us (0.82%, 0.26%)
Conv2DBackpropInput           230.97MB (44.62%, 22.57%),    48.77ms (16.69%, 9.55%),        48.16ms (29.68%, 17.02%),       610us (0.56%, 0.27%)
Conv2D                        225.74MB (22.06%, 22.06%),    36.50ms (7.15%, 7.15%),         35.84ms (12.66%, 12.66%),       664us (0.29%, 0.29%)

So it seems to me that my custom Ops take about as much time on the GPU, but more than an order of magnitude longer on the CPU. For a GPU Op, that is not acceptable and I'd like to find where my Ops spend this time on the CPU. What additionally startles me, is that my Ops seem to only allocate one third of the GPU Memory that the original Conv Ops do.

Is there a way to get a detailed profile of my custom Op (which includes CPU and GPU usage) that can explain to me what I did wrong and help me fix my mistakes?

Christoph Pohl
  • 325
  • 5
  • 19
  • I too would be interested in seeing a more detailed profiling view. What did you look into in the end? My assumption is that Conv2D is using the Winograd implementation, which requires significantly larger memory consumption (precomputing intermediate values using the conv weights). – Roy Sep 06 '20 at 11:57

0 Answers0