Snakemake hangs when cluster (slurm) cancelled a job

Question

Maybe the answer is obvious for many, but I am quite surprised I could not find a question regarding this topic, which represents a major problem for me. I would greatly appreciate a hint!

When submitting a job on a cluster managed by slurm, if the queue manager cancels the job (e.g. for insufficient resources or time), snakemake seems to not receive any signal, and hangs forever. On the other hand, when the job fails, also snakemake fails, as expected. Is this behavior normal/wanted? How can I have snakemake to fail also when a job gets cancelled? I had this problem with snakemake version 3.13.3 and it remained updating to 5.3.0.

For example in this case I launch a simple pipeline, with insufficient resources for the rule pluto:

$ snakemake -j1 -p --cluster 'sbatch --mem {resources.mem}' pluto.txt
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Unlimited resources: mem
Job counts:
    count   jobs
    1       pippo
    1       pluto
    2

[Tue Sep 25 16:04:21 2018]
rule pippo:
    output: pippo.txt
    jobid: 1
    resources: mem=1000

seq 1000000 | shuf > pippo.txt
Submitted job 1 with external jobid 'Submitted batch job 4776582'.
[Tue Sep 25 16:04:31 2018]
Finished job 1.
1 of 2 steps (50%) done

[Tue Sep 25 16:04:31 2018]
rule pluto:
    input: pippo.txt
    output: pluto.txt
    jobid: 0
    resources: mem=1

sort pippo.txt > pluto.txt
Submitted job 0 with external jobid 'Submitted batch job 4776583'.

Here it hangs. And here is the content of the job accounting:

$ sacct -S2018-09-25-16:04 -o jobid,JobName,state,ReqMem,MaxRSS,Start,End,Elapsed
       JobID    JobName      State     ReqMem     MaxRSS               Start                 End    Elapsed
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------
4776582      snakejob.+  COMPLETED     1000Mn            2018-09-25T16:04:22 2018-09-25T16:04:27   00:00:05
4776582.bat+      batch  COMPLETED     1000Mn      1156K 2018-09-25T16:04:22 2018-09-25T16:04:27   00:00:05
4776583      snakejob.+ CANCELLED+        1Mn            2018-09-25T16:04:32 2018-09-25T16:04:32   00:00:00
4776583.bat+      batch  CANCELLED        1Mn      1156K 2018-09-25T16:04:32 2018-09-25T16:04:32   00:00:00

What is the snakemake version? I had problems with `v5.2.x` in lsf cluster where if I kill jobs, snakemake would not recognize that; So l I rolled back to `v4.8.0`. — Manavalan Gajapathy, Sep 25 '18 at 16:27
I have exactly the same problem with SGE. If something goes wrong in the queue manager, the job can stay forever in Eqw mode and snakemake never knows about this and hangs forever. — Eric C., Sep 26 '18 at 08:30
@JeeYem I edited including the version info, thanks for your support! — davide m, Sep 26 '18 at 09:15
`it remained updating to 5.3.0` - did you update direct to `v5.3.0` or tried some version in between as well? If former, I would suggest trying out `v4.6` or `v4.8` (just couple versions I have used). — Manavalan Gajapathy, Sep 26 '18 at 14:02
@JeeYem yes I updated directly but, after you comment, I tried `v4.8.0` too and the problem remains unfortunately. — davide m, Sep 30 '18 at 12:28
For the record, this seems to be the same issue reported here https://bitbucket.org/snakemake/snakemake/issues/427/detect-killed-jobs-in-cluster-environment which is duplicated in https://bitbucket.org/snakemake/snakemake/issues/418/terminating-nodes-not-recognized-by and https://bitbucket.org/snakemake/snakemake/issues/625/exceeded-memory-limit-from-slurm-does-not — dariober, Nov 10 '18 at 08:41

score 4 · Answer 1 · answered Dec 09 '19 at 17:27

Snakemake doesn't recognize all kinds of job statuses in slurm (and also in other job schedulers). To bridge this gap, snakemake provides option --cluster-status, where custom python script can be provided. As per snakemake's documentation:

 --cluster-status

Status command for cluster execution. This is only considered in combination with the –cluster flag. 
If provided, Snakemake will use the status command to determine if a job has finished successfully or failed. 
For this it is necessary that the submit command provided to –cluster returns the cluster job id. 
Then, the status command will be invoked with the job id. 
Snakemake expects it to return ‘success’ if the job was successfull, ‘failed’ if the job failed and ‘running’ if the job still runs.

Example shown in snakemake's doc to use this feature:

#!/usr/bin/env python
import subprocess
import sys

jobid = sys.argv[1]

output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())

running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]
if "COMPLETED" in output:
  print("success")
elif any(r in output for r in running_status):
  print("running")
else:
  print("failed")

To use this script call snakemake similar to below, where status.py is the script above.

$ snakemake all --cluster "sbatch --cpus-per-task=1 --parsable" --cluster-status ./status.py

Alternatively, you may use premade custom scripts for several job schedulers (slurm, lsf, etc), available via Snakemake-Profiles. Here is the one for slurm - slurm-status.py.

Snakemake hangs when cluster (slurm) cancelled a job

1 Answers1

Linked