TensorFlow: Specify number of Ops run in parallel

Question

As I understand, TF invokes multiple operators in parallel as long as they are independent. (link)

And the parallelism can be controlled by inter_op_parallelism_threads and intra_op_parallelism_threads if operators ar e running on CPU (link). However these parameters does not affect GPU operators at all. How can I control parallelism of GPU? (for example, run operators serially although there are independent operators)

EDIT:

a=tf.random_normal([N,N])
b=tf.random_normal([N,N])
c=tf.random_normal([N,N])
d=tf.random_normal([N,N])

x=tf.matmul(a,b)
y=tf.matmul(c,d)
z=tf.matmul(x,y)

http://stackoverflow.com/questions/39481453/tensorflow-device-contexts-streams-and-context-switching — Yaroslav Bulatov, Jan 11 '17 at 06:56
@YaroslavBulatov Then sess.run(z) should take 3x than sess.run(x), right? However, in my experiment, it only takes 2x. — enc, Jan 11 '17 at 07:10
@YaroslavBulatov Is there any way to measure the time taken only in matmul? — enc, Jan 11 '17 at 14:52

Yaroslav Bulatov · Answer 1 · 2017-01-11T17:02:45.693

Here's a way to profile execution that avoids common pitfalls:

# Turn off graph-rewriting optimizations
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))

# throw error if explicit device placement can't be satisfied
config.allow_soft_placement = False

N = 8192
with tf.device("/gpu:0"):
    input1 = tf.Variable(tf.random_normal([N,N]))
    input2 = tf.Variable(tf.random_normal([N,N]))
    result = tf.matmul(input1, input2)
    result_no_output = result.op # to avoid transferring data back to Python
sess = tf.Session(config=config)

# load values into GPU
sess.run(tf.global_variables_initializer())

# pre-warming
sess.run(result_no_output)

num_ops = N**3 + N**2*(N-1)  # N^3 muls, N^2 (N-1) adds
elapsed = []
for i in range(10):
    start = time.time()

    sess.run(result_no_output)
    elapsed.append(time.time()-start)

print("%d x %d matmul, %.2f elapsed, %.2f G ops/sec"%(N, N, min(elapsed), num_ops/min(elapsed)/10**9))

On TitanX pascal this shows 9.5 T ops/sec which is close to theoretical max os 11 T ops/sec theoretical maximum

8192 x 8192 matmul, 0.12 elapsed, 9527.10 G ops/sec

TensorFlow: Specify number of Ops run in parallel

1 Answers1