How can the tf.Variable be shared in between-graph replication training?

Question

I read the document of Distributed TensorFlow and have a question about between-graph replication. https://www.tensorflow.org/versions/master/how_tos/distributed/index.html

In my understanding, between-graph replication training creates same number of graphs as workers and the graphs share tf.Variables on parameter servers. That is, one worker creates one session and one graph, and all graphs share same tf.Variable.

However, I just thought two different sessions can not share the same tf.Variable. Is it misunderstanding?

Variables created in distributed sessions are special (ie, `Session("grpc://.."`), unlike in direct sessions, those variables are persistent — Yaroslav Bulatov, Oct 17 '16 at 18:08

Changming Sun · Accepted Answer · 2017-06-21T07:58:28.497

3

For your last question:

"Can two different sessions share the same tf.Variable?"

For distributed sessions(e.g. Session("grpc://..")), they can.
For direct sessions, they can't.

In distributed training, variables are managed by tf.Server(), persistent across sessions. Remember? Server are created before sessions. It lives longer than tf.Sessions.

edited Jun 21 '17 at 07:58

answered Jun 16 '17 at 09:59

Changming Sun

857
2
7
19

1

could you provide code location of this feature? I'm confused about this part too. I notice in graph_mgr sub-graphs in a session can use the same OpKernel, but how difference sessions(created by multiple workers) shared the variable. – hakunami Sep 03 '19 at 12:32
@hakunami Good question. I wonder if this is related to my question [on `shared_name` of variables and other TF resources](https://stackoverflow.com/questions/61993100/shared-name-for-tensorflow-operations). – Albert May 29 '20 at 15:02

How can the tf.Variable be shared in between-graph replication training?

1 Answers1