Let's say I have the following line of code in TensorFlow (Python interface):
z = tf.matmul(W_1,x_1) + tf.matmul(W_2,x_2) + ... + tf.matmul(W_N, x_N) + b
All of the above N operations are independent, and the result is accumulated in z. Will TensorFlow, for example, launch N kernels independently and then accumulate the result, or will it process N operations in series?
I ask because this has an impact on how much effort I need to expend to vectorize operations, at the expense of reduced readability and convenience. What I am hoping is that TF launches all N GPU kernels asynchronously, accumulates the output in z, and returns the result.
Additionally, assuming TF does process the above statement in parallel, are there any limitations on this? For instance, if I was to accumulate z in a for loop (or over several lines with intermediate variables), would I lose this benefit?