Tensorflow FailedPreconditionError, but all variables have been initialized

Question

EDIT: After trying several things, I have added the following to my code:

with tf.Session(graph=self.graph) as session:
    session.run(tf.initialize_all_variables())
    try:
        session.run(tf.assert_variables_initialized())
    except tf.errors.FailedPreconditionError:
        raise RuntimeError("Not all variables initialized!")

Now, occasionally this fails, i.e. tf.assert_variables_initialized() will raise FailedPreconditionError, even though immediately before it, tf.initialize_all_variables() was executed. Does anyone have any idea how this can happen?

Original question:

Background

I'm running cross-validated (CV) hyperparameter search on a basic neural net created through Tensorflow, with GradientDescentOptimizer. At seemingly random moments I'm getting a FailedPreconditionError, for different Variables. For example (full stack trace at end of post):

FailedPreconditionError: Attempting to use uninitialized value Variable_5
     [[Node: Variable_5/read = Identity[T=DT_FLOAT, _class=["loc:@Variable_5"], _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_5)]]

Some runs fail fairly fast, others not -- one has been running for 15 hours now without problems. I'm running this in parallel on multiple GPUs - not the optimization itself, but each CV fold.

What I've checked

From this and this post I understand that this error occurs when attempting to use Variables that haven't been initialized using tf.initialize_all_variables(). However, I am 99% certain that I'm doing this (and if not, I'd expect it to always fail) - I'll post code below.

The API doc says that

This exception is most commonly raised when running an operation that reads a tf.Variable before it has been initialized.

"Most commonly" suggests that it can also be raised in different scenarios. So, for now the main question:

Question: are there other scenarios under which this exception may be raised, and what are they?

Code

MLP class:

class MLP(object):
    def __init__(self, n_in, hidden_config, n_out, optimizer, f_transfer=tf.nn.tanh, f_loss=mean_squared_error,
                 f_out=tf.identity, seed=None, global_step=None, graph=None, dropout_keep_ratio=1):

        self.graph = tf.Graph() if graph is None else graph           
        # all variables defined below
        with self.graph.as_default():
            self.X = tf.placeholder(tf.float32, shape=(None, n_in))
            self.y = tf.placeholder(tf.float32, shape=(None, n_out))
            self._init_weights(n_in, hidden_config, n_out, seed)
            self._init_computations(f_transfer, f_loss, f_out)
            self._init_optimizer(optimizer, global_step)

     def fit_validate(self, X, y, val_X, val_y, val_f, iters=100, val_step=1):
            [snip]
            with tf.Session(graph=self.graph) as session:
VAR INIT HERE-->tf.initialize_all_variables().run() #<-- VAR INIT HERE
                for i in xrange(iters):
                    [snip: get minibatch here]    
                    _, l = session.run([self.optimizer, self.loss], feed_dict={self.X:X_batch, self.y:y_batch})
                    # validate
                    if i % val_step == 0:
                        val_yhat = self.validation_yhat.eval(feed_dict=val_feed_dict, session=session)

As you can see, tf.init_all_variables().run() is always called before anything else is done. The net is initialized as:

def estimator_getter(params):
    [snip]    
    graph = tf.Graph()
    with graph.as_default():
        global_step = tf.Variable(0, trainable=False)
        learning_rate = tf.train.exponential_decay(params.get('learning_rate',0.1), global_step, decay_steps, decay_rate)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    net = MLP(config_num_inputs[config_id], hidden, 1, optimizer, seed=params.get('seed',100), global_step=global_step, graph=graph, dropout_keep_ratio=dropout)

Full example stack trace:

FailedPreconditionError: Attempting to use uninitialized value Variable_5
     [[Node: Variable_5/read = Identity[T=DT_FLOAT, _class=["loc:@Variable_5"], _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_5)]]
Caused by op u'Variable_5/read', defined at:
  File "tf_paramsearch.py", line 373, in <module>
    randomized_search_params(int(sys.argv[1]))
  File "tf_paramsearch.py", line 356, in randomized_search_params
    hypersearch.fit()
  File "/home/centos/ODQ/main/python/odq/cv.py", line 430, in fit
    return self._fit(sampled_params)
  File "/home/centos/ODQ/main/python/odq/cv.py", line 190, in _fit
    for train_key, test_key in self.cv)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 766, in __call__
    n_jobs = self._initialize_pool()
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 537, in _initialize_pool
    self._pool = MemmapingPool(n_jobs, **poolargs)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/pool.py", line 580, in __init__
    super(MemmapingPool, self).__init__(**poolargs)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/pool.py", line 418, in __init__
    super(PicklingPool, self).__init__(**poolargs)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 159, in __init__
    self._repopulate_pool()
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
    w.start()
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 113, in worker
    result = (True, func(*args, **kwds))
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 130, in __call__
    return self.func(*args, **kwargs)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/centos/ODQ/main/python/odq/cv.py", line 131, in _fold_runner
    estimator = estimator_getter(parameters)
  File "tf_paramsearch.py", line 264, in estimator_getter
    net = MLP(config_num_inputs[config_id], hidden, 1, optimizer, seed=params.get('seed',100), global_step=global_step, graph=graph, dropout_keep_ratio=dropout)
  File "tf_paramsearch.py", line 86, in __init__
    self._init_weights(n_in, hidden_config, n_out, seed)
  File "tf_paramsearch.py", line 105, in _init_weights
    self.out_weights = tf.Variable(tf.truncated_normal([hidden_config[-1], n_out], stddev=stdev))
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 206, in __init__
    dtype=dtype)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 275, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 523, in identity
    return _op_def_lib.apply_op("Identity", input=input, name=name)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
    op_def=op_def)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2117, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

One potential thing I see is that you are mixing default session and explicit session. IE, you do "initialize_all_variables().run()", which uses default session, but later you explicitly specify session. So perhaps you are running your initializer in the wrong session? I prefer to always have one default session with it's associated default graph, that way you don't need "with" blocks and less likely to using wrong session/graph — Yaroslav Bulatov, Apr 21 '16 at 19:34
PS: I just ran your original snippets ("initialize_all_variables" followed by "assert_..") 10k times and didn't get any failures. — Yaroslav Bulatov, Apr 21 '16 at 20:03
Thanks, yeah that's one of the things I tried, I changed that line to `session.run(tf.initialize_all_variables())`, to no avail. And yes, it doesn't always fail (and I assume my code has a problem somewhere, whereas yours probably doesn't) -- I have one session still running without problems. The only difference I can see is that the nets in that session have more input features than in the others, the rest of the code is exactly the same. — Matt, Apr 22 '16 at 07:47

score 5 · Answer 1 · answered Apr 22 '16 at 09:57

5

Ok, I've found the problem. There was a rare condition in my code that resulted in one of the hidden layers to be created with shape (0, N), i.e. no inputs. In this case, Tensorflow apparently fails to initialize the variables pertaining to that layer.

While this makes sense, it might be useful for Tensorflow to log a warning message in such cases (btw, I also tried to set Tensorflow logging to debug mode, but couldn't find how -- tf.logging.set_verbosity() didn't seem to have an effect).

answered Apr 22 '16 at 09:57

Matt

282
3
13

hm, that probably should be fixed on TF side....maybe file an issue on github with a small repro? – Yaroslav Bulatov Apr 22 '16 at 13:52
Thanks! Sorry never got around to filing the Github issue – Matt Apr 30 '16 at 11:11

score 1 · Answer 2 · answered Apr 21 '16 at 19:36

1

BTW, for efficiency/less bugs, you could follow following pattern.

tf.reset_default_graph()
a = tf.constant(1)
<add more operations to your graph>
b = tf.Variable(1)
init_op = tf.initialize_all_variables()
tf.get_default_graph().finalize()

sess = tf.InteractiveSession()
sess.run(init_op)
sess.run(compute_op)

The finalize prevents you from modifying graph between runs which is slow in the current version. Also, because there's one session/one graph, you don't need with blocks.

answered Apr 21 '16 at 19:36

Yaroslav Bulatov

57,332
22
139
197

Thanks, I'll give it a try. – Matt Apr 22 '16 at 07:50
Found the problem, see my answer below. Thanks for the help – Matt Apr 22 '16 at 09:58

score 0 · Answer 3 · answered May 30 '18 at 10:13

For me the solution was

with sess.as_default():
    result = compute_fn([seed_input,1])

check FailedPreconditionError: Attempting to use uninitialized in Tensorflow for other options and my explanation

Strangely session.run() is not the same as running a function with sess.as_default(), I tried both.

Tensorflow FailedPreconditionError, but all variables have been initialized

3 Answers3