1

I was running the demo shakespeare.py from prettytensor and wondered how using the CPU vs the GPU affects the training runtime per batch. I thus added the following lines in local_trainer.py:

tick = time.time()
results = sess.run(ops, dict(zip(feed_vars, data)))
print('done in %.4f secs' % (time.time() - tick))

which is located on line 309 in the run_model function.

I then forced the training to happen on the CPU by setting export CUDA_VISIBLE_DEVICES="". I monitored GPU usage through watch -n 1 nvidia-smi as well as watch -n 1 lsof /dev/nvidia* so I'm sure that the GPU was not touched. Surprisingly, running it on the CPU was faster (~ 0.2 secs) then on the GPU (~0.25 secs).

When monitoring CPU usage through htop I observed that all 8 CPU threads were nicely used. Together with the communication-overhead that using the GPU creates this could be a possible explanation I guess. Also, may be the model is too small to actually benefit from the GPU computation power and/or my GPU is just too low-end.

My question is: have you ever observed similar behavior using prettytensor or tensorflow (may be also where the GPU version was a bit faster but not terribly)? Do these explanations make sense or is this behavior too weird to be true? Are there other tools/tricks I could use to figure out what is going on on the GPU exactly when accessing it through prettytensor (or tensorflow for that matter)? I am aware of the timeline feature from tensorflow as described here but I find this a bit hard to decipher.

My GPU: NVIDIA GeForce GT 730 (2GB) major: 3 minor: 5 memoryClockRate (GHz) 0.9015

My CPU: 4 cores (2 hyperthreads per core) of type Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

Community
  • 1
  • 1
kafman
  • 2,862
  • 1
  • 29
  • 51

1 Answers1

2

These observations make sense and I also had a model which worked slightly slower on GPU than on a CPU. What was the case then, was that the network was a small one, and the whole process was CPU-bound and transferring CPU->GPU->CPU slowed down things a little bit.

What you can try to do in your case is to run the model with GPU and see if the GPU utilization (via nvidia-smi) is small and, at the same time, utilization of CPU is high.

sygi
  • 4,557
  • 2
  • 32
  • 54
  • CPU usage is not that high when running it on the GPU. `nvidia-smi` reports usage of 1.8 GB (of 2GB) but I guess this is because tensorflow allocates as much memory as possible, even if it is not directly used. Unfortunately `nvidia-smi` does not output per-process statistics on my system. – kafman Dec 15 '16 at 14:29
  • I thought of rather checking the `GPU-Util` tab in `nvidia-smi`. It is not per-process, but you can see if it changes significantly before and after start of your training process. – sygi Dec 15 '16 at 14:31
  • Oh ok, just checked that: my `nvidia-smi` does not output that either but I monitored it through `nvidia-settings -q GPUUtilization` and it showed > 90 % utilization. So I guess there's just too much overhead going into CPU/GPU communication. – kafman Dec 15 '16 at 14:44