My understanding is that TensorFlow creates two thread pools on each device: one for intra op parallelism and one for inter op parallelism.
Suppose there are 3 independent ops A, B, C placed on /gpu:0
and intra_op_parallelism_threads=5
. Suppose A and B have a single-threaded GPU kernel implementation, and C has a multi-threaded kernel implementation, does that mean that they can all run in parallel on the same device, A and B using just 1 GPU thread while C uses up to 3 GPU threads?
Now suppose inter_op_parallelism_threads=2
, does that mean that only 2 out of 3 ops can be evaluated simultaneously on /gpu:0
, so in the example above, it may be A+B, B+C or A+C depending on who gets there first?
Note: I'm trying to make sense of @mrry's answer to this question: Tensorflow: executing an ops with a specific core of a CPU