2

Let's say I have the following line of code in TensorFlow (Python interface):

z = tf.matmul(W_1,x_1) + tf.matmul(W_2,x_2) + ... + tf.matmul(W_N, x_N) + b

All of the above N operations are independent, and the result is accumulated in z. Will TensorFlow, for example, launch N kernels independently and then accumulate the result, or will it process N operations in series?

I ask because this has an impact on how much effort I need to expend to vectorize operations, at the expense of reduced readability and convenience. What I am hoping is that TF launches all N GPU kernels asynchronously, accumulates the output in z, and returns the result.

Additionally, assuming TF does process the above statement in parallel, are there any limitations on this? For instance, if I was to accumulate z in a for loop (or over several lines with intermediate variables), would I lose this benefit?

Jonathan
  • 21
  • 2

1 Answers1

1

Yes, it runs multiple paths of computation of a single session.run call in parallel, controlled by num_inter_device_parallelism_threads parameter. You can use tf.add_n for your sum. If you have multiple session.run you need to parallelize things yourself, by, say, launching them in separate Python threads.

Yaroslav Bulatov
  • 57,332
  • 22
  • 139
  • 197