0

The training session of a Tensorflow estimator with train and evaluate trainspec gets occasionally killed.

I would like to resume the training session upon reading the output "Killed" (generated by tf.logging.INFO). Ideally by executing the python script again and again. Is there a short way to accomplish this?

ubik
  • 15
  • 3
  • 2
    duplicate: https://stackoverflow.com/questions/11162406/open-and-write-data-to-text-file-using-bash-shell-scripting `python script.py > output.txt` – keithpjolley Sep 06 '18 at 14:10
  • Possible duplicate of [Open and write data to text file using bash/shell scripting](https://stackoverflow.com/questions/11162406/open-and-write-data-to-text-file-using-bash-shell-scripting) – Albin Paul Sep 06 '18 at 14:15
  • I don't see any duplication. – ubik Sep 06 '18 at 14:16

2 Answers2

0

Not too much experience about it, but according to my limited knowledge, you can turn to the use of pipe in Linux. Like this,

tail -f xxx.log | grep --line-buffered killed_information | while read msg ; do python train.py ; done

Note: killed_information should be replaced by actual error outputs of the train.py

r3dir3ct
  • 36
  • 3
0
while [ 1 ]; do

    if grep -Fxq "killed" logFile; then
       # code if found (Run your script again from here)
    fi

    #check every 5 minutes
    sleep 300

done

(Code adopted from https://stackoverflow.com/a/4749368/10008499 )

xxx374562
  • 226
  • 1
  • 8