Since currently there is no easy way to profile TensorFlow operations (Can I measure the execution time of individual operations with TensorFlow?), can anyone help me understand the benefits of using segment operations (e.g. segment_sum
) compared to using multiple operations on pre-segmented tensors? Would segment_sum
be more efficient than using dynamic_partition
or gather
followed by multiple reduce_sum
? Would segment_sum
be equally parallelizable?

- 1
- 1

- 33,828
- 17
- 76
- 92
1 Answers
I've updated the SO question you link to with some information about CPU inference profiling tools we've recently released at: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/benchmark
Unfortunately the overall question is a lot harder to answer, since it depends on:
Whether you're focused on training, or inference.
If you're using a GPU, and if so what kind and how many.
Whether you're running distributed.
What your data looks like, and where the bottlenecks are.
What I usually end up doing is building small sub-graphs that are representative of the sort of ops I'm considering, and then timing how long they take on the sort of data I'll be feeding in. I know that isn't immediately helpful, since the experimentation can be time-consuming, but it is the best way to get an intuitive understanding of the optimal solution for your particular circumstances.

- 2,866
- 1
- 13
- 12
-
Just wondering why you mention training or inference as a factor? I'm just wondering about the properties of an operation and actually, I am not using TF for implementing CNNs (or other NNs). Regarding other points, can TF automatically parallelize an operation (or even a subgraph) on multiple GPUs? I though it cannot. Let's assume for the sake of this question that we are speaking of running each of the two cases on a single GPU (i.e. I am not going to compare the single op solution to multi-op solution distributed on multiple GPUs). – Andrzej Pronobis Jun 07 '16 at 01:38