2

I tried running a word-RNN model from github on Google Cloud ML . After submitting the job,I am getting errors in log file.

This is what i submitted for training

gcloud ml-engine jobs submit training word_pred_7 \
    --package-path trainer \
    --module-name trainer.train \
    --runtime-version 1.0 \
    --job-dir $JOB_DIR \
    --region $REGION \
    -- \
    --data_dir gs://model-development/arpit/word-rnn-tensorflow-master/data/tinyshakespeare/real1.txt \
    --save_dir gs://model-development/arpit/word-rnn-tensorflow-master/save

This is what I get in the log file.

enter image description here

mcarton
  • 27,633
  • 5
  • 85
  • 95
Appu
  • 83
  • 9
  • I am running into this same issue. What did you change to solve it? I don't really understand your accepted answer. – AntsaR Jul 13 '19 at 10:56

4 Answers4

4

Finally, after submitting 77 jobs to cloud ML I am able to run the job and problem was not with the arguments while submitting the job. It was about the IO errors generated by files .npy which have to stores using file_io.FileIo and read as StringIO.

These IO Errors have not been mentioned anywhere and one should check for them if they find any errors where it says no such file or directory.

Appu
  • 83
  • 9
  • Apologies for the frustrating attempts, but glad to hear you've found a resolution. All file operations done while running in the cloud must use file_io.FileIO. Related: http://stackoverflow.com/a/41637685/1399222 – rhaertel80 Apr 03 '17 at 16:34
  • follow-up question: were you seeing the same error message noted in the screenshot in your original question the whole time and this fixed it, or was the error message different? – rhaertel80 Apr 03 '17 at 16:36
  • I was seeing the same error all the time in the logs – Appu Apr 04 '17 at 06:08
  • We'll investigate why you're not seeing a more informative error message. – rhaertel80 Apr 04 '17 at 14:44
3

You will need to modify your train.py to accept a "--job-dir" command-line argument.

When you specify --job-dir in gcloud, the service passes it through to your program as an argument, so your argparser (or tf.flags, depending on which you're using), will need to be modified accordingly.

rhaertel80
  • 8,254
  • 1
  • 31
  • 47
  • Even after making the changes in the train.py , I am still getting the same error. – Appu Mar 29 '17 at 07:35
  • def main(): parser = argparse.ArgumentParser() parser.add_argument('--data_dir', type=str, default=None, help='data directory containing real1.txt') parser.add_argument('--log_dir', type=str, default=None, help='directory containing tensorboard logs') parser.add_argument('--job_dir', type=str, default=None, help='directory to store checkpointed models') – Appu Mar 29 '17 at 07:38
  • Here is the github link to the code that I am trying to run. https://github.com/hunkim/word-rnn-tensorflow – Appu Mar 29 '17 at 07:41
  • Two notes: (1) I looked at your github code, and I don't see '--job_dir' anywhere. Assuming it was present at one time (2) it should be '--job-dir'. (note the dash vs. underscore difference). – rhaertel80 Apr 04 '17 at 14:53
  • While submittng a job on gogle cloud ML, I am getting an error where main training python file i.e. task.py is not able to import a function from a python script in util folder. Generally, we write : from util.xyz import abc this is not getting called in the main task.py – Appu Apr 12 '17 at 09:39
  • @Appu I'm sorry I missed your last question. CloudML pip installs your code, so you'll need to alter your import statement accordingly, such as `from mypackage.util.xyz import abc`. Relative imports may work as well (untested) – rhaertel80 Aug 07 '17 at 14:03
3

I had the same issue and it seems like google cloud somehow uses that --job-dir anyway when loading your own script (even if you place it before -- on the gcloud command)

The way I fixed it like the official gcloud census example on line 153 and line 183:

parser.add_argument(
  '--job-dir',
  help='GCS location to write checkpoints and export models',
  required=True
)
args = parser.parse_args()
arguments = args.__dict__
job_dir = arguments.pop('job_dir')

train_model(**arguments)

Basically it means to let your python main program take in this --job-dir parameter, even if you are not using it.

Fuyang Liu
  • 1,496
  • 13
  • 26
0

In addition to adding --job-dir as accepted argument, I think you should also move the flag after the --.

From the getting started:

Run the local train command using the --distribued option. Be sure to place the flag above the -- that separates the user arguments from the command-line arguments

where, in that case, --distribued was a command-line argument

EDIT:

--job-dir IS NOT a user argument, so it is correct to place it before the --

EffePi
  • 356
  • 2
  • 13
  • I ran in local without distributed flag and its running . But when i submit a job on cloud ML, it again shows the same error – Appu Mar 31 '17 at 10:15
  • What is the command-line you are running now? The error might be caused by flags that are required by the job run in the cloud but not required locally – EffePi Mar 31 '17 at 10:35
  • on local client: gcloud ml-engine local train \ --module-name trainer.task \ --package-path trainer/ \ --job -- \ --data_dir /home/arpit_agrawal/char-rnn/char-rnn/data/tinyshakespeare/\ --save_dir /home/arpit_agrawal/char-rnn/char-rnn/ \ --log_dir /home/arpit_agrawal/char-rnn/char-rnn/ – Appu Mar 31 '17 at 10:58
  • job run submit : gcloud ml-engine jobs submit training final_6 \ --runtime-version 1.0 \ --job-dir $GCS_JOB_DIR \ --module-name trainer.task \ --package-path trainer/ \ --region us-central1 \ -- \--data_dir gs://model-development/arpit/char-rnn/data/tinyshakespeare/ \ --save_dir gs://model-development/arpit/char-rnn/save/ – Appu Mar 31 '17 at 11:01
  • $GCS_JOB_DIR is a path to Google Cloud Storage – Appu Mar 31 '17 at 11:02
  • The --job-dir flag is not an argument of gcloud ml-engine but an argument of task.py. Therefore you should move it after the -- which separates user arguments from command line arguments. In your command, the isolated -- is between "--region us-central1" and " \--data_dir " – EffePi Mar 31 '17 at 11:08
  • It gives this error if i move --job-dir after -- flag into task.py arguments: ERROR: (gcloud.ml-engine.jobs.submit.training) If local packages are provided, the `--staging-bucket` or `--job-dir` flag must be given. – Appu Mar 31 '17 at 11:12
  • I apologize, I got it wrong. --job-dir IS actually an argument for gcloud.ml-engine... – EffePi Mar 31 '17 at 12:18
  • codecs.open(input_file, "r", encoding=self.encoding) as f: File "/usr/lib/python2.7/codecs.py", line 878, in open file = __builtin__.open(filename, mode, buffering)** IOError: [Errno 2] No such file or directory: '/home/arpit_agrawal/char-rnn/char-rnn/data/tinyshakespeare/input.txt'** – Appu Mar 31 '17 at 12:46