I was running the demo shakespeare.py
from prettytensor and wondered how using the CPU vs the GPU affects the training runtime per batch. I thus added the following lines in local_trainer.py
:
tick = time.time()
results = sess.run(ops, dict(zip(feed_vars, data)))
print('done in %.4f secs' % (time.time() - tick))
which is located on line 309 in the run_model
function.
I then forced the training to happen on the CPU by setting export CUDA_VISIBLE_DEVICES=""
. I monitored GPU usage through watch -n 1 nvidia-smi
as well as watch -n 1 lsof /dev/nvidia*
so I'm sure that the GPU was not touched. Surprisingly, running it on the CPU was faster (~ 0.2 secs) then on the GPU (~0.25 secs).
When monitoring CPU usage through htop
I observed that all 8 CPU threads were nicely used. Together with the communication-overhead that using the GPU creates this could be a possible explanation I guess. Also, may be the model is too small to actually benefit from the GPU computation power and/or my GPU is just too low-end.
My question is: have you ever observed similar behavior using prettytensor or tensorflow (may be also where the GPU version was a bit faster but not terribly)? Do these explanations make sense or is this behavior too weird to be true? Are there other tools/tricks I could use to figure out what is going on on the GPU exactly when accessing it through prettytensor (or tensorflow for that matter)? I am aware of the timeline feature from tensorflow as described here but I find this a bit hard to decipher.
My GPU: NVIDIA GeForce GT 730 (2GB) major: 3 minor: 5 memoryClockRate (GHz) 0.9015
My CPU: 4 cores (2 hyperthreads per core) of type Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz