Assume this code:
w1 = tf.get_variable(...)
w2 = tf.get_variable(...)
x = ...
y1 = tf.matmul(x, w1)
y2 = tf.matmul(x, w2)
session.run([y1, y2], ...)
TensorFlow can potentially run ops in parallel (controlled via option inter_op_parallelism_threads
).
My question: Will it actually do that for this case here (matmul
) (and extending on that: on all kinds of GPU ops)? I think to do that, it would need to create multiple CUDA streams, right? Does it do that automatically (and how)? Or will they be executed sequentially on the GPU?
(Note that for this simple example, you could also rewrite the code by concatenating w1
and w2
, then doing a single matmul
, and then splitting afterwards. But that is not my question.)
(Related is this question, which basically would answer that it will always use a single CUDA stream for all GPU ops and thus this will not run in parallel. Not sure if this is up-to-date, though.)