Tensorflow - Profiling using timeline - Understand what is limiting the system

Question

I am trying to understand why each train iteration takes aprox 1.5 sec. I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system. It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.

Unfortunately the code is very complicated and I can't post a small version of it that has the same problem

Is there a way to understand from the profiler what is taking the space in the gaps between operations?

Thanks!

EDIT:

On CPU ony I do not see this behavior:

I am running a

BTW, there is no need to use timeline now. Take a look at [my answer here](http://stackoverflow.com/a/43692312/1090562) to see how you can debug your model via tensorboard. — Salvador Dali, May 07 '17 at 08:58
Thanks, but for some reason I don't see the Node Stats in my TB... — aarbelle, May 07 '17 at 09:54
Some thoughts: some things could be not reflected in timeline -- time spent transferring data through feed dict, grpc latency. Do you have similar gaps if you run on CPU only? Could stuff be waiting on some dequeue operations? You can also insert tf.Print nodes and look at the timestamps generated there. — Yaroslav Bulatov, May 07 '17 at 17:00
I tired it. It is a bit difficult to insert all those tf.Prints and to understand exactly what happens when... Is there maybe another option? — aarbelle, May 11 '17 at 05:10
What is happening in between the training iterations and could this be slowing things down? Where is your batched data? Is it local or remote? What kind of data is it; i.e. is it high-dimensional? Is there any processing/transformation of the data taking place in between training iterations? And roughly what species of network architecture are you using? You might also find some helpful suggestions here: https://www.tensorflow.org/performance/performance_guide, particularly related to queueing data. — unsupervised_learner, May 17 '17 at 19:53
Hi, Nothing happens between iterations. There is a for loop on the sess.run(train_step) only. Everything is in queues and from tensorboard it seems that all queues are full at all times, so i don't think its data flow that is slowing things down. The queue capacity is 10*batch_size. I am using relatively small fully cnn with only 3x3 and 1x1 kernels. — aarbelle, May 18 '17 at 05:07

score 0 · Answer 1 · answered May 12 '17 at 13:35

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.

Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

1. I am not running out of memory. I am using on ~ 10% of my gpu memory. 2. This does not happen on CPU only. I added the timeline to the original question 3. This is a single-machine job. 4. I am calling sess.run() once for each training step. Thanks! — aarbelle, May 14 '17 at 09:12

Tensorflow - Profiling using timeline - Understand what is limiting the system

1 Answers1