2

I have question on distributed TensorFlow. To understand its behavior, I make the code in the following link(https://gist.github.com/wsjeon/12bded1e3c4f81c775622f72e74c007b). There are two questions.

  1. For the code above, sometimes I got error and sometimes it works. When it doesn't work, I got the following error message:
Traceback (most recent call last):
  File "main.py", line 39, in 
    _, step = sess.run([assign_op, global_step])
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: /job:worker/replica:0/task:0/gpu:0 unknown device.
     [[Node: local/add_S3 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=4764918041242746699, tensor_name="edge_7_local/add", tensor_type=DT_INT32, _device="/job:ps/replica:0/task:0/cpu:0"]()]]

I cannot figure out why this occurs.

  1. I used two "workers" for global counter. However, I found that some of numbers are duplicated. How can I fix this?

0 Answers0