5

I'm putting together a snakemake slurm workflow and am having trouble with my working directory becoming cluttered with slurm output files. I would like my workflow to, at a minimum, direct these files to a 'slurm' directory inside my working directory. I currently have my workflow set up as follows:

config.yaml:

reads:
    1:
    2:
samples:
    15FL1-2: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS/data/15FL1-2
    15Fl1-4: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS/data/15Fl1-4

cluster.yaml:

localrules: all

__default__:
    time: 0:5:0
    mem: 1G
    output: _{rule}_{wildcards.sample}_%A.slurm

fastqc_raw:
    job_name: sm_fastqc_raw
    time: 0:10:0
    mem: 1G
    output: slurm/_{rule}_{wildcards.sample}_{wildcards.read}_%A.slurm

Snakefile:

configfile: "config.yaml"
workdir: config["work"]

rule all:
    input:
        expand("analysis/fastqc_raw/{sample}_R{read}_fastqc.html", sample=config["samples"],read=config["reads"])

rule clean:
    shell:
        "rm -rf analysis logs"

rule fastqc_raw:
    input:
        'data/{sample}_R{read}.fastq.gz'
    output:
        'analysis/fastqc_raw/{sample}_R{read}_fastqc.html'
    log:
        err = 'logs/fastqc_raw/{sample}_R{read}.out',
        out = 'logs/fastqc_raw/{sample}_R{read}.err'
    shell:
        """
        fastqc {input} --noextract --outdir 'analysis/fastqc_raw' 2> {log.err} > {log.out}
        """

I then call with:

snakemake --jobs 4  --cluster-config cluster.yaml --cluster "sbatch --mem={cluster.mem} --time={cluster.time} --job-name={cluster.job_name} --output={cluster.output}"

This does not work, as the slurm directory does not already exist. I don't want to manually make this before running my snakemake command, that will not work for scalability. Things I've tried, after reading every related question, are:

1) simply trying to capture all the output via the log within the rule, and setting cluster.output='/dev/null'. Doesn't work, the info in the slurm output isn't captured as it's not output of the rule exactly, its info on the job

2) forcing the directory to be created by adding a dummy log:

    log:
        err = 'logs/fastqc_raw/{sample}_R{read}.out',
        out = 'logs/fastqc_raw/{sample}_R{read}.err'
        jobOut = 'slurm/out.err'

I think this doesn't work because sbatch tries to find the slurm folder before implementing the rule

3) allowing the files to be made in the working directory, and adding bash code to the end of the rule to move the files into a slurm directory. I believe this doesn't work because it tries to move the files before the job has finished writing to the slurm output.

Any further ideas or tricks?

Ensa
  • 105
  • 1
  • 5

2 Answers2

1

You should be able to suppress these outputs by calling sbatch with --output=/dev/null --error=/dev/null. Something like this:

snakemake ... --cluster "sbatch --output=/dev/null --error=/dev/null ..."

If you want the files to go to a directory of your choosing you can of course change the call to reflect that:

snakemake ... --cluster "sbatch --output=/home/Ensa/slurmout/%j.out --error=/home/Ensa/slurmout/%j.out ..."
Maarten-vd-Sande
  • 3,413
  • 10
  • 27
  • yes, I can suppress the outputs with ```--output=/dev/null``` etc, however what I am looking for is a way to specify the output directory without having to hard code it in the command line snakemake call (I would also need to explicitly make this directory first, as sbatch won't make a directory that doesn't exist). This seriously limits the portability of the workflow. I really need a way of making the output directory relative to the work directory, and then for users to simply have to specify the work directory in the config file, to make this a portable workflow. – Ensa May 18 '20 at 11:27
  • You can always put the logic in a [profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles). – Maarten-vd-Sande May 18 '20 at 11:39
  • p.s. by hardcoding the slurm output in your rules output or log you are making your workflow the opposite of portable – Maarten-vd-Sande May 18 '20 at 11:42
  • 1
    thanks, I'll look at profile and see if I can get it to do what I need. Respectfully disagree that what I want makes it the opposite of portable; as long as you can specify the slurm output directory relative to the working directory, then the user need only specify the working directory once, in some configuration file, and it is both portable and reproducible. – Ensa May 18 '20 at 12:19
1

So this is how I solved the issue (there's probably a better way, and if so, I hope someone will correct me). Personally I will go to great lengths to avoid hard-coding anything. I use a snakemake profile and an sbatch script.

First, I make a snakemake profile that contains a line like this:

cluster: "sbatch --output=slurm_out/slurm-%j.out --mem={resources.mem_mb} -c {resources.cpus} -J {rule}_{wildcards} --mail-type=FAIL --mail-user=me@me.edu"

You can see the --output parameter to redirect the slurm output files to a subdirectory called slurm_out in the current working directory. But AFAIK, slurm can't create that directory if it doesn't exist. So...

Next I make a small sbatch script whose only job is to make the subdirectory, then call the sbatch script to submit the workflow. This "wrapper" looks like:

#!/bin/bash

mkdir -p ./slurm_out
sbatch snake_submit.sbatch

And finally, the snake_submit.sbatch looks like:

#!/bin/bash

ml snakemake

snakemake --profile <myprofile>

In this case both the wrapper and the sbatch script that it calls will have their slurm out files in the current working directory. I prefer it that way because it's easier for me to locate them. But I think you could easily re-direct by adding another #SBATCH --output parameter to the snake_submit.sbatch script (but not the wrapper, then it's turtles all the way down, you know?).

I hope that makes sense.

rachelette
  • 47
  • 5