0

I'm new to using cluster computers to run experiments. I have a script running in python that should be regularly printing out information, but I find that when my job exceeds its walltime, I get no output at all except the notification that the job has been killed.

I've tried regularly flushing the buffer to no avail, and was wondering if there was something more basic that I'm missing.

Thanks!

squiggles
  • 109
  • 2
  • Similar questions come up often on this site, e.g. https://stackoverflow.com/q/20233650/1328439 or https://stackoverflow.com/q/46759079/1328439 Maybe it deserves an extensive write up covering different possible solutions. – Dima Chubarov Mar 05 '18 at 06:30

1 Answers1

1

I'm guessing you are having issues with a job cleanup script in the epilogue. You may want to ask the admins about it. You may also want to try a different approach.

If you were to redirect your output to a file in a shared filesystem you should be able to avoid data loss. This assumes you have a shared filesystem to work with and you aren't required to stage in and stage out all of your data.

If you reuse your submission script you can avoid clobbering the output of other jobs by including the $PBS_JOBID environment variable in the output filename.

script.py > $PBS_JOBID.out

  • On mobile so check qsub man page for a list of job environment variables.
chuck
  • 735
  • 3
  • 4