Can I measure the execution time of individual operations with TensorFlow?

Question

I know I can measure the execution time of a call to sess.run(), but is it possible to get a finer granularity and measure the execution time of individual operations?

Olivier Moindrot · Answer 1 · 2016-12-26T17:03:12.930

111

I have used the Timeline object to get the time of execution for each node in the graph:

you use a classic sess.run() but also specify the optional arguments options and run_metadata
you then create a Timeline object with the run_metadata.step_stats data

Here is an example program that measures the performance of a matrix multiplication:

import tensorflow as tf
from tensorflow.python.client import timeline

x = tf.random_normal([1000, 1000])
y = tf.random_normal([1000, 1000])
res = tf.matmul(x, y)

# Run the graph with full trace option
with tf.Session() as sess:
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(res, options=run_options, run_metadata=run_metadata)

    # Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()
    with open('timeline.json', 'w') as f:
        f.write(ctf)

You can then open Google Chrome, go to the page chrome://tracing and load the timeline.json file. You should see something like:

edited Dec 26 '16 at 17:03

answered Jun 12 '16 at 12:52

Olivier Moindrot

27,908
11
92
91

1

Hi! I tried creating a Timeline for my Network training, but unfortunately doing it as you showed only produces a timeline for the last invocation of session.run. Is there a way to aggregate the timeline over all sessions? – fat-lobyte Jul 22 '16 at 10:01
6

Using TensorFlow 0.12.0-rc0, I found that I needed to make sure that libcupti.so/libcupti.dylib was in the library path in order for this to work. For me (on Mac), I added `/usr/local/cuda/extras/CUPTI/lib` to the `DYLD_LIBRARY_PATH`. – Daniel Trebbien Dec 12 '16 at 16:41
1

Or `LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:${LD_LIBRARY_PATH}` on Ubuntu – Justin Harris Apr 30 '18 at 23:40
Why is there an add operator here? – user2991421 Aug 07 '18 at 00:35
Because when calling `tf.random_normal`, TensorFlow first create a random tensor with mean 0 and variance 1. It then multiplies by the standard deviation (1 here) and adds the mean (0 here). – Olivier Moindrot Aug 07 '18 at 10:03
My `.json` is around 1.7 GB and I'm unable to open it in Chrome. Is there any other way to open it like in tensorboard ? – Vedanshu Feb 25 '20 at 09:06

score 28 · Accepted Answer · answered Dec 15 '15 at 19:41

28

There is not yet a way to do this in the public release. We are aware that it's an important feature and we are working on it.

answered Dec 15 '15 at 19:41

Ian Goodfellow

2,584
2
19
20

14

Is it possible that there is an update to this answer? Because https://github.com/tensorflow/tensorflow/issues/899 seems as if one could probably calculate the FLOPs for individual operations which could give insights into the execution time. – Martin Thoma Dec 20 '16 at 22:07

score 23 · Answer 3 · answered Dec 16 '17 at 00:18

Since this is high up when googling for "Tensorflow Profiling", note that the current (late 2017, TensorFlow 1.4) way of getting the Timeline is using a ProfilerHook. This works with the MonitoredSessions in tf.Estimator where tf.RunOptions are not available.

estimator = tf.estimator.Estimator(model_fn=...)
hook = tf.train.ProfilerHook(save_steps=10, output_dir='.')
estimator.train(input_fn=..., steps=..., hooks=[hook])

score 14 · Answer 4 · edited Apr 04 '18 at 20:24

14

You can extract this information using runtime statistics. You will need to do something like this (check the full example in the above-mentioned link):

run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(<values_you_want_to_execute>, options=run_options, run_metadata=run_metadata)
your_writer.add_run_metadata(run_metadata, 'step%d' % i)

Better than just printing it you can see it in tensorboard:

Additionally, clicking on a node will display the exact total memory, compute time, and tensor output sizes.

edited Apr 04 '18 at 20:24

benjaminplanche

14,689
5
57
69

answered Apr 29 '17 at 05:57

Salvador Dali

214,103
147
703
753

1

The link (https://www.tensorflow.org/programmers_guide/graph_viz#runtime_statistics) has been updated. – benjaminplanche Apr 04 '18 at 17:09

score 11 · Answer 5 · answered Jun 07 '16 at 01:13

11

To update this answer, we do have some functionality for CPU profiling, focused on inference. If you look at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/benchmark you'll see a program you can run on a model to get per-op timings.

answered Jun 07 '16 at 01:13

Pete Warden

2,866
1
13
12

What about getting GPU benchmarks on raw Tensorflow operations? – brethvoice Aug 19 '21 at 17:10

score 2 · Answer 6 · answered Oct 12 '17 at 21:42

2

For the comments of fat-lobyte under Olivier Moindrot's answer, if you want to gather the timeline over all sessions, you can change "open('timeline.json', 'w')" to "open('timeline.json', 'a')".

answered Oct 12 '17 at 21:42

Cheney

29
2

score 1 · Answer 7 · answered May 02 '18 at 12:26

1

As of Tensorflow 1.8, there's a really good example for using the tf.profile.Profiler here.

answered May 02 '18 at 12:26

nitred

5,309
3
25
29

1

The link is dead, is there an updated version ? (still for TF 1.x) – Astariul Feb 22 '21 at 03:35

score 1 · Answer 8 · answered Jan 13 '20 at 10:14

2.0 Compatible Answer: You can use Profiling in Keras Callback.

Code for that is :

log_dir="logs/profile/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch = 3)

model.fit(train_data,
          steps_per_epoch=20,
          epochs=5, 
          callbacks=[tensorboard_callback])

For more details on how to Profiling, refer this Tensorboard Link.

Michael Litvin · Answer 9 · 2022-03-09T13:25:06.980

This works with Tensorflow 2 (tested with TF 2.5 and 2.8):

import tensorflow as tf

tf.profiler.experimental.start(r'/path/to/logdir')
with tf.profiler.experimental.Trace("My cool model", _r=1):
    run_model_that_you_want_to_profile()
tf.profiler.experimental.stop()

Then you can see the trace in Tensorboard (tensorboard --logdir /path/to/logdir, then open http://localhost:6006/#profile in the browser).

Might also be useful:

Guide: Optimize TensorFlow performance using the Profiler
tf.summary.trace_on() (didn't try it myself)
This colab tutorial on using the Tensorboard profiler

score -1 · Answer 10 · edited Apr 25 '21 at 18:09

Recently released by Uber SBNet custom op library (http://www.github.com/uber/sbnet) has an implementation of cuda event based timers, which can be used in the following manner:

with tf.control_dependencies([input1, input2]):
    dt0 = sbnet_module.cuda_timer_start()
with tf.control_dependencies([dt0]):
    input1 = tf.identity(input1)
    input2 = tf.identity(input2)

### portion of subgraph to time goes in here

with tf.control_dependencies([result1, result2, dt0]):
    cuda_time = sbnet_module.cuda_timer_end(dt0)
with tf.control_dependencies([cuda_time]):
    result1 = tf.identity(result1)
    result2 = tf.identity(result2)

py_result1, py_result2, dt = session.run([result1, result2, cuda_time])
print "Milliseconds elapsed=", dt

Note that any portion of subgraph can be asynchronous you should be very careful with specifying all the input and output dependencies for the timer ops. Otherwise, the timer might get inserted into the graph out of order and you can get erroneous time. I found both the timeline and time.time() timing of very limited utility for profiling Tensorflow graphs. Also note that cuda_timer APIs will synchronize on the default stream, which is currently by design because TF uses multiple streams.

Having said this I personally recommend switching to PyTorch :) Development iteration is faster, code runs faster and everything is a lot less painful.

Another somewhat hacky and arcane approach to subtracting the overhead from tf.Session (which can be enormous) is to replicate the graph N times and run it for a variable N, solving for an equation of unknown fixed overhead. I.e. you'd measure around session.run() with N1=10 and N2=20 and you know that your time is t and overhead is x. So something like

N1*x+t = t1
N2*x+t = t2

Solve for x and t. Downside is this might require a lot of memory and is not necessarily accurate :) Also make sure that your inputs are completely different/random/independent otherwise TF will fold the entire subgraph and not run it N times... Have fun with TensorFlow :)

This example is lacking a complete set of variables or suggestion on how to create them. When I clicked on the sbnet repo in Github, it appears to be 3-4 years stale anyway. — brethvoice, Aug 19 '21 at 17:07

Can I measure the execution time of individual operations with TensorFlow?

10 Answers10

Linked