Two exactly same systems have very different performances when running Tensorflow script on GPU

Question

I have two computers with the same GPU(GTX 1080), installed the same copy of OS and softwares. But when I run my tensorflow program(an RNN model), the speed are very different. One is about 1.5x faster than the other.

Here are the key specs of the two:

SystemA: Asus Z170-P, i7 6700T, 32GB Ram, GTX 1080.
SystemB: Asus X99 E-WS, i7 5930K, 128G Ram, GTX 1080. (Problem one)

Both are installed with(using the same method):

OS: Ubuntu 16.04
GPU driver version: 378.13
Cuda version: 8.0
cuDNN version: 5.1
Tensorflow: installed using method pip install tensorflow-gpu==1.0.1
Python: Anaconda 3.6

Sample code:

import tensorflow as tf
import numpy as np
from tqdm import trange

h,w = 3000, 2000
steps = 1000

x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)

x0 = np.random.random(size=[h, w])
sess = tf.Session()
for i in trange(steps):
    x0 = sess.run(m, feed_dict={x: x0})

SystemA performs 75 iter/sec and systemB only has 50 iter/sec, yes the poorer one is actually faster.

Key observations:

SystemB have a much larger page fault while running the program.
By monitoring the Volatile GPU-Util from nvidia-smi, systemA stably seat at about 40% while systemB is about 30%.

Things I have tried on systemB:

Upgrade BIOS to the latest version and reset default settings.
Call Asus customer service for help.
Swap GPU card with system A.
Change PCI-e slot to make sure it running at x16 gen3.
Inject LD_PRELOAD="/usr/lib/libtcmalloc.so" to .bashrc file.

The main differences of the output of /usr/bin/time -v are:

# The first value is for systemB and the second is for systemA.
System time (seconds): 7.28  2.95
Percent of CPU this job got: 85%  106%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.41  0:14.89
Minor (reclaiming a frame) page faults: 684695  97853
Involuntary context switches: 164  91063
File system inputs: 0  24
File system outputs: 8  0

Can anybody point me to a direction of how to profile/debug this issue? Many thanks in advance!

score 1 · Accepted Answer · edited May 23 '17 at 12:02

1

There is a chance that you may not be using GPUs. To test this use

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

to display what devices you are using.

If indeed you are using CPU, then you can add the following before your tensorflow code

with tf.device('/gpu:0'):  # NEW LINE
    x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
    t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
    m = tf.matmul(x,t)

If this isn't the case, add a comment with your results and I'll follow up to see what else I can do.

According to some sources tf.constant is a GPU memory hog. try replacing

t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)

with

t = tf.Variable(np.random.random(size=[w, w]), dtype=tf.float32)

trying a model without network traffic

import tensorflow as tf
import numpy as np
from tqdm import trange

h,w = 3000, 2000
steps = 1000

x = tf.random_normal( [h, w] , dtype=tf.float32 )
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
s = tf.reduce_mean( tf.reduce_mean( m ) )

sess = tf.Session()
for i in range(steps):
    sess.run(s)

Results of Experimentation with Xer After much discussion and trouble shooting, it has become apparent that indeed the two machines are different. The Nvidia cards were swapped which resulted in no change. They have 2 different CPUs, one with a graphics processor built in and 1 without. One with a higher CPU, one without. I suggested that machine with an onboard graphics on the i7 had the OSs graphic windowing system disabled to make sure that the test is unused GPU vs unused GPU. The problem persisted.

The original problem that was posted creates huge amounts of data traffic across the main BUS from the CPU to the Nvidia GPUs as can be seen here

     Tx Throughput               : 75000 KB/s
     Rx Throughput               : 151000 KB/s

We experimented with changing the size of the problem from W=2000, W=200, and W=1000 and found that when W was small enough that the two machines performed nearly identically. W though not only controls the size of the problem on the GPU but also the amount of traffic between the CPU and the GPU.

Although we did find a solution or an exact model, I believe that after much exploration with @Xer I can say with confidence that the two systems are not the same and their physical difference (BUS+CPU) makes the performance difference.

edited May 23 '17 at 12:02

Community

1
1

answered May 10 '17 at 01:37

Anton Codes

3,663
1
19
28

Thank you so much for response to my question! Unluckily, the issue still remains with manually set all ops to gpu, less than 30% `gpu-volatile` and 50 iter/sec. – Xer May 10 '17 at 04:50
what did the log_device_placement report? – Anton Codes May 10 '17 at 04:52
it reports :`MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:841] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0 Const: (Const): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:841] Const: (Const)/job:localhost/replica:0/task:0/gpu:0 x: (Placeholder): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:841] x: (Placeholder)/job:localhost/replica:0/task:0/gpu:0` – Xer May 10 '17 at 04:53
I added another possibility that it is tf.constant. Can you try that and see? Also, can you try decreasing the size of W by a factor of x10 ? – Anton Codes May 10 '17 at 04:55
I set W to 200, and the volatile is about 20%, spped is 561 iter/sec. Sorry, i dont quite get your meaning baout tf.constant, it is under the `tf.device('/gpu:0")` block at moment, should I move it out? – Xer May 10 '17 at 05:00
the comment about "t", is to try and do a search-replace for "tf.constant" and replace with "tf.Variable", rerun and see the performance. – Anton Codes May 10 '17 at 05:03
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/143827/discussion-between-wontonimo-and-xer). – Anton Codes May 10 '17 at 05:04
Yes, its definitely the cpu-gpu traffic causing the problem. Running your no traffic network code, sysB outperforms sysA. Really would like to know how can I possibly fix the BUS+CPU on systemB. – Xer May 10 '17 at 14:54
I think i've solved your original question. Please mark this as the correct answer :) . It was fun working through this with you. Hardware configuration would be another question. – Anton Codes May 10 '17 at 14:57
Aboslutely! and I am going to try my luck on supersuer:) – Xer May 10 '17 at 14:59