I have question on distributed TensorFlow. To understand its behavior, I make the code in the following link(https://gist.github.com/wsjeon/12bded1e3c4f81c775622f72e74c007b). There are two questions.
- For the code above, sometimes I got error and sometimes it works. When it doesn't work, I got the following error message:
Traceback (most recent call last): File "main.py", line 39, in _, step = sess.run([assign_op, global_step]) File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: /job:worker/replica:0/task:0/gpu:0 unknown device. [[Node: local/add_S3 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=4764918041242746699, tensor_name="edge_7_local/add", tensor_type=DT_INT32, _device="/job:ps/replica:0/task:0/cpu:0"]()]]
I cannot figure out why this occurs.
- I used two "workers" for global counter. However, I found that some of numbers are duplicated. How can I fix this?