Adanet : ERROR:tensorflow:Model diverged with loss = NaN

Question

I got an error of loss when training an Adanet network following the tutorial notebook ( Adanet_objective ) and just switching the dataset, this is the error : ERROR:tensorflow:Model diverged with loss = NaN. full detail here :

 INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
ERROR:tensorflow:Model diverged with loss = NaN.

I clear the log everytime I try something as I know it can lead to this problems also,

My dataset is similar to the example one and there is no specific size in the code

Reference dataset example:

x_train[0], y_train[0]
array([  1.23247,   0.     ,   8.14   ,   0.     ,   0.538  ,   6.142  ,
         91.7    ,   3.9769 ,   4.     , 307.     ,  21.     , 396.9    ,
         18.72   ]),
 15.2

my dataset example:

x_train[0], y_train[0]
array([1977.        ,    0.        ,    1.        ,    0.225     ,
       0.11111111,    0.10169492,    0.22072072,    0.48296441,
       0.00934278,    0.25761378,    0.29399057,    0.04283397,
       0.3241088 ,    0.20679821,    0.09214841,    2.31192802,
      48.97102657,    0.14316477,    0.17729479,    0.22970639,
       0.33924853]),
 0.09940174401130854

So obviously values change since it's my dataset but the type is the same and I don't understand why is the training working on the reference dataset and with mine I got an error of loss. My dataset is normalized here but I also tried without normalizing it and got same error.

I tried changing the learning_rate and set it as low as 0.00001 but it didn't affect anything

If anyone got a clue that would be helpful thx, I looked it already but things I found like clearing the logs or changing the learning_rate didn't work.

As I said the model work fined on the first dataset and I dont see anything specific on the code to make it work on this dataset Adanet_objective .

Adanet : ERROR:tensorflow:Model diverged with loss = NaN

0 Answers0