Maybe the answer is obvious for many, but I am quite surprised I could not find a question regarding this topic, which represents a major problem for me. I would greatly appreciate a hint!
When submitting a job on a cluster managed by slurm, if the queue manager cancels the job (e.g. for insufficient resources or time), snakemake seems to not receive any signal, and hangs forever. On the other hand, when the job fails, also snakemake fails, as expected. Is this behavior normal/wanted? How can I have snakemake to fail also when a job gets cancelled? I had this problem with snakemake version 3.13.3 and it remained updating to 5.3.0.
For example in this case I launch a simple pipeline, with insufficient resources for the rule pluto:
$ snakemake -j1 -p --cluster 'sbatch --mem {resources.mem}' pluto.txt
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Unlimited resources: mem
Job counts:
count jobs
1 pippo
1 pluto
2
[Tue Sep 25 16:04:21 2018]
rule pippo:
output: pippo.txt
jobid: 1
resources: mem=1000
seq 1000000 | shuf > pippo.txt
Submitted job 1 with external jobid 'Submitted batch job 4776582'.
[Tue Sep 25 16:04:31 2018]
Finished job 1.
1 of 2 steps (50%) done
[Tue Sep 25 16:04:31 2018]
rule pluto:
input: pippo.txt
output: pluto.txt
jobid: 0
resources: mem=1
sort pippo.txt > pluto.txt
Submitted job 0 with external jobid 'Submitted batch job 4776583'.
Here it hangs. And here is the content of the job accounting:
$ sacct -S2018-09-25-16:04 -o jobid,JobName,state,ReqMem,MaxRSS,Start,End,Elapsed
JobID JobName State ReqMem MaxRSS Start End Elapsed
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------
4776582 snakejob.+ COMPLETED 1000Mn 2018-09-25T16:04:22 2018-09-25T16:04:27 00:00:05
4776582.bat+ batch COMPLETED 1000Mn 1156K 2018-09-25T16:04:22 2018-09-25T16:04:27 00:00:05
4776583 snakejob.+ CANCELLED+ 1Mn 2018-09-25T16:04:32 2018-09-25T16:04:32 00:00:00
4776583.bat+ batch CANCELLED 1Mn 1156K 2018-09-25T16:04:32 2018-09-25T16:04:32 00:00:00