I am interested in a way to measure the detailed performance of a custom Tensorflow Op when run on a GPU.
So far I have tried the approach of this post using a Timeline, as well as the internal Tensorflow Profiler (tf.profiler.Profiler
). Both deliver very similar results, which are fine if I want to investigate a Network, but for profiling a single Op the output is too coarse and doesn't include intra-op calculations (at least I couldn't find a way for that). My next try was using the CUDA profiler nvprof
(or nvvp
for that matter), which is more in the right direction and displays single calls to CUDA kernels and Memory allocations. But now, the CPU calculations are not included. I tried running nvprof --cpu-profiling on
, but now the profiler never finishes (see here)
My scenario is the following: I wrote a custom Op that is very similar to a convolution in 2D and should not take much more time to calculate. In a Network, my custom Op's performance is about 3 times worse than tf.nn.conv2d
. Using the tf.profiler.Profiler
I get the following:
Profile:
node name | requested bytes | total execution time | accelerator execution time | cpu execution time
CustomConv2DBackpropInput 72.09MB (100.00%, 7.04%), 194.36ms (100.00%, 38.05%), 49.82ms (100.00%, 17.61%), 144.54ms (100.00%, 63.44%)
CustomConv2D 65.54MB (92.96%, 6.40%), 95.41ms (61.95%, 18.68%), 45.16ms (82.39%, 15.96%), 50.25ms (36.56%, 22.06%)
CustomConv2DBackpropFilter 134.48MB (86.55%, 13.14%), 72.39ms (43.27%, 14.17%), 41.22ms (66.44%, 14.56%), 31.17ms (14.50%, 13.68%)
Conv2DBackpropFilter 294.68MB (73.41%, 28.79%), 63.39ms (29.10%, 12.41%), 62.80ms (51.87%, 22.19%), 594us (0.82%, 0.26%)
Conv2DBackpropInput 230.97MB (44.62%, 22.57%), 48.77ms (16.69%, 9.55%), 48.16ms (29.68%, 17.02%), 610us (0.56%, 0.27%)
Conv2D 225.74MB (22.06%, 22.06%), 36.50ms (7.15%, 7.15%), 35.84ms (12.66%, 12.66%), 664us (0.29%, 0.29%)
So it seems to me that my custom Ops take about as much time on the GPU, but more than an order of magnitude longer on the CPU. For a GPU Op, that is not acceptable and I'd like to find where my Ops spend this time on the CPU. What additionally startles me, is that my Ops seem to only allocate one third of the GPU Memory that the original Conv Ops do.
Is there a way to get a detailed profile of my custom Op (which includes CPU and GPU usage) that can explain to me what I did wrong and help me fix my mistakes?