I got an error of loss when training an Adanet network following the tutorial notebook ( Adanet_objective ) and just switching the dataset, this is the error : ERROR:tensorflow:Model diverged with loss = NaN. full detail here :
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './logs/uniform_average_ensemble_baseline', '_tf_random_seed': 42, '_save_summary_steps': 5000, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:Estimator's model_fn (<function Estimator._create_model_fn.<locals>._adanet_model_fn at 0x7f3d42ba6c20>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Saving checkpoints for 0 into ./logs/uniform_average_ensemble_baseline/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
ERROR:tensorflow:Model diverged with loss = NaN.
I clear the log everytime I try something as I know it can lead to this problems also,
My dataset is similar to the example one and there is no specific size in the code
Reference dataset example:
x_train[0], y_train[0]
array([ 1.23247, 0. , 8.14 , 0. , 0.538 , 6.142 ,
91.7 , 3.9769 , 4. , 307. , 21. , 396.9 ,
18.72 ]),
15.2
my dataset example:
x_train[0], y_train[0]
array([1977. , 0. , 1. , 0.225 ,
0.11111111, 0.10169492, 0.22072072, 0.48296441,
0.00934278, 0.25761378, 0.29399057, 0.04283397,
0.3241088 , 0.20679821, 0.09214841, 2.31192802,
48.97102657, 0.14316477, 0.17729479, 0.22970639,
0.33924853]),
0.09940174401130854
So obviously values change since it's my dataset but the type is the same and I don't understand why is the training working on the reference dataset and with mine I got an error of loss. My dataset is normalized here but I also tried without normalizing it and got same error.
I tried changing the learning_rate and set it as low as 0.00001 but it didn't affect anything
If anyone got a clue that would be helpful thx, I looked it already but things I found like clearing the logs or changing the learning_rate didn't work.
As I said the model work fined on the first dataset and I dont see anything specific on the code to make it work on this dataset Adanet_objective .