This is kind of a general best practice question. I have a Python script which iterates over some arguments and calls another script with those arguments (it's basically a grid search for some simple Deep Learning models). This works fine on my local machine, but now I need the resources of my unis computer cluster, which uses SLURM. I have some logic in the python script that I think would be difficult to implement, and maybe out of place, in a shell script. I also can't just throw all the jobs at the cluster at once, because I want to skip certain parameter combination depending on the outcome (loss) of others. Now I'd like to submit the SLURM jobs directly from my python script and still handle the more complexe logic there. My question now is what the best way to implement something like this is and if running a python script on the login node would be bad mannered. Should I use the subprocess module? Snakemake? Joblib? Or are there other, more elegant ways?
Asked
Active
Viewed 943 times
1
-
You might also be interested in [slurmpy](https://github.com/brentp/slurmpy). – Manavalan Gajapathy May 09 '20 at 23:17
-
I had a look at slurmpy, seems to me `s.run(..., depends_on=[job_id])` only let's the job start after another has finished, but I want to skip some jobs entirely depending on the result of others. – Dunrar May 15 '20 at 09:51
-
`depends_on` can be controlled as needed, and it is `None` by default. If for some reason you want to wait for a job to finish, you may include [sbatch's `--wait`](https://stackoverflow.com/a/49509245/3998252) option. It's not clear to me if you would like your jobs to be dependent, but slurmpy should be able to handle them either way. – Manavalan Gajapathy May 15 '20 at 13:38
-
I tried that with s.run(_cmd="sbatch --wait"), but that did not work. How would I add a flag to the sbatch command with slurmpy? – Dunrar May 18 '20 at 10:47
-
`s = Slurm("job-name", {"W": "", "partition": "my-parition"})`, where `W` is `--wait` should work. That dictionary is where all slurm resources and flags are defined. – Manavalan Gajapathy May 18 '20 at 14:45
1 Answers
1
Snakemake and Joblib are valid options, they will handle the communication with the Slurm cluster. Another possibility is Fireworks. This one is a bit more tedious to get running ; it needs a MongoDB database, and has a vocabulary that needs getting used to, but in the end it can do very complex stuff. You can for instance create a workflow that submits jobs to multiple clusters and run other jobs dependent of the output of the previous ones, and automatically re-submit the ones that failed, with other parameters if needed.

damienfrancois
- 52,978
- 9
- 96
- 110
-
Okay, thank you very much. Fireworks sounds like a bit of an overkill for my current project, but I'll have a look at it in the future. – Dunrar May 14 '20 at 11:44