0

I'm trying to run a training job for a object detection model on google cloud. It fails after logging the following from each ps-replica.

Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name) 
{
 insertId:  "1am4lt7g2ytgyip"  
 jsonPayload: {
  created:  1532870862.316736   
  levelname:  "CRITICAL"   
  lineno:  27   
  message:  "Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name) "   
  pathname:  "tensorflow/core/common_runtime/renamed_device.cc"   
 }
 labels: {
  compute.googleapis.com/resource_id:  "8188383009228980271"   
  compute.googleapis.com/resource_name:  "cmle-training-ps-1d73aafb3a-0-7bjnw"   
  compute.googleapis.com/zone:  "us-central1-a"   
  ml.googleapis.com/job_id:  "object_detection_07_29_2018_14_17_36"   
  ml.googleapis.com/job_id/log_area:  "root"   
  ml.googleapis.com/task_name:  "ps-replica-0"   
  ml.googleapis.com/trial_id:  ""   
 }
 logName:  "projects/object-detection-210310/logs/ps-replica-0"  
 receiveTimestamp:  "2018-07-29T13:27:48.515404065Z"  
 resource: {
  labels: {
   job_id:  "object_detection_07_29_2018_14_17_36"    
   project_id:  "object-detection-210310"    
   task_name:  "ps-replica-0"    
  }
  type:  "ml_job"   
 }
 severity:  "CRITICAL"  
 timestamp:  "2018-07-29T13:27:42.316735982Z"  
}

Followed by this:

-ps-replica-1
Command '['python', '-m', u'object_detection.model_main', u'-- 
model_dir=gs://aka_b1/train/', u'-- 
pipeline_config_path=gs://aka_b1/data/ssd_mobilenet_v1_coco.config', '--job- 
dir', u'gs://aka_b1/train/']' returned non-zero exit status -6

{
 insertId:  "1d4klnfg3ihl2be"  
 jsonPayload: {
  created:  1532870863.971174   
  levelname:  "ERROR"   
  lineno:  879   
  message:  "Command '['python', '-m', u'object_detection.model_main', u'--model_dir=gs://aka_b1/train/', u'--pipeline_config_path=gs://aka_b1/data/ssd_mobilenet_v1_coco.config', '--job-dir', u'gs://aka_b1/train/']' returned non-zero exit status -6"   
  pathname:  "/runcloudml.py"   
 }
 labels: {
  compute.googleapis.com/resource_id:  "7345648913232166992"   
  compute.googleapis.com/resource_name:  "cmle-training-ps-1d73aafb3a-1-tjx4f"   
  compute.googleapis.com/zone:  "us-central1-a"   
  ml.googleapis.com/job_id:  "object_detection_07_29_2018_14_17_36"   
  ml.googleapis.com/job_id/log_area:  "root"   
  ml.googleapis.com/task_name:  "ps-replica-1"   
  ml.googleapis.com/trial_id:  ""   
 }
 logName:  "projects/object-detection-210310/logs/ps-replica-1"  
 receiveTimestamp:  "2018-07-29T13:27:47.591698250Z"  
 resource: {
  labels: {
   job_id:  "object_detection_07_29_2018_14_17_36"    
   project_id:  "object-detection-210310"    
   task_name:  "ps-replica-1"    
  }
  type:  "ml_job"   
 }
 severity:  "ERROR"  
 timestamp:  "2018-07-29T13:27:43.971174001Z"  
}

I tried after replacing tfrecords, config file and ckpt files of a successfull training job I ran earlier. But the issue remains. Only difference is the bucket name, which I changed in the config file and at the training job submission command.

Please help.

  • Do you have a pointer to the code that you are running? – rhaertel80 Jul 30 '18 at 14:50
  • Hi, Do you mean the command to submit the training job? I used the same code given here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md but used only 5 workers. Thank you very much. – Akalanka Weerasooriya Jul 30 '18 at 17:08
  • I got 404 error when trying to list your bucket gs://aka_b1/. Double check if the bucket name is correct? – yxshi Jul 30 '18 at 18:16
  • Hi, I deleted that bucket and started fresh. Now there is a bucket called ml_bucket_0 giving the same error. By the way how are you listing it? I'm not sure if it is possible becasue of user authentication settings. Many tanks. – Akalanka Weerasooriya Jul 31 '18 at 01:28
  • Usually a non-existing bucket will return 404 error while it is 403 error if it is due to authentication problem. Your new bucket does return 403 now. – yxshi Jul 31 '18 at 17:20
  • The exit status -6 means that the training process received a `SIGABRT`, more explanation can be found in this [thread](https://stackoverflow.com/questions/3413166/when-does-a-process-get-sigabrt-signal-6/3413233). This may indicate that it got some internal error with TF library. Could you please tell if the issue is a deterministic one? – lwz1992 Aug 01 '18 at 00:14
  • Do you mind following up with us by emailing cloudml-feedback@google.com so we can help examine your logs? – rhaertel80 Aug 01 '18 at 14:36
  • @Iwz1992 Thank you for the valuable information. I got the error about a dozen times over two days. Yes, I think its a deterministic one. – Akalanka Weerasooriya Aug 02 '18 at 17:44

1 Answers1

1

I think I found the problem. I had included Tensorflow in REQUIRED_PACKAGES in the setup.py, trying to overcome a different issue faced earlier. After removing it, this error didn't appear. Many thanks to everybody.

  • In this case, would you consider accepting your own question in case someone else finds this useful? – rilla Oct 25 '18 at 12:22