Will TensorFlow matmul run in parallel on GPU? (Or any GPU ops.)

Question

Assume this code:

w1 = tf.get_variable(...)
w2 = tf.get_variable(...)
x = ...
y1 = tf.matmul(x, w1)
y2 = tf.matmul(x, w2)

session.run([y1, y2], ...)

TensorFlow can potentially run ops in parallel (controlled via option inter_op_parallelism_threads).

My question: Will it actually do that for this case here (matmul) (and extending on that: on all kinds of GPU ops)? I think to do that, it would need to create multiple CUDA streams, right? Does it do that automatically (and how)? Or will they be executed sequentially on the GPU?

(Note that for this simple example, you could also rewrite the code by concatenating w1 and w2, then doing a single matmul, and then splitting afterwards. But that is not my question.)

(Related is this question, which basically would answer that it will always use a single CUDA stream for all GPU ops and thus this will not run in parallel. Not sure if this is up-to-date, though.)

score -1 · Answer 1 · edited Jun 20 '20 at 09:12

From the official FAQ:

Does the runtime parallelize parts of graph execution?

The TensorFlow runtime parallelizes graph execution across many different dimensions:

The individual ops have parallel implementations, using multiple cores in a CPU, or multiple threads in a GPU.

Independent nodes in a TensorFlow graph can run in parallel on multiple devices, which makes it possible to speed up CIFAR-10 training using multiple GPUs.

The Session API allows multiple concurrent steps (i.e. calls to tf.Session.run in parallel. This enables the runtime to get higher throughput, if a single step does not use all of the resources in your computer.

This is what I already mentioned in my question, and it does not answer my question. — Albert, Jul 25 '18 at 13:19

Will TensorFlow matmul run in parallel on GPU? (Or any GPU ops.)

1 Answers1

Linked