4

I am creating my first snakemake file, and I got to the point where I need to perform a simple string operation on the value of my output, so that my shell command works as expected:

rule sketch:
  input:
    'out/genomes.txt'
  output:
    'out/genomes.msh'
  shell:
    'mash sketch -l {input} -k 31 -s 100000 -o {output}'

I need to apply the split function to {output} so that only the name of the file up to the extension is used. I couldn't find anything in the docs or in related questions.

mgalardini
  • 1,397
  • 2
  • 16
  • 30

5 Answers5

4

You could use the params field:

rule sketch:
  input:
    'out/genomes.txt'
  output:
    'out/genomes.msh'
  params:
    dir = 'out/genomes'
  shell:
    'mash sketch -l {input} -k 31 -s 100000 -o {params.dir}'
jma1991
  • 345
  • 1
  • 3
  • 15
4

Best is to use params:

rule sketch:
    input:
        'out/genomes.txt'
    output:
        'out/genomes.msh'
    params:
        prefix=lambda wildcards, output: os.path.splitext(output[0])[0]
    shell:
        'mash sketch -l {input} -k 31 -s 100000 -o {params.prefix}'

It is always preferable to use params instead of using the run directive, because the run directive cannot be combined with conda environments.

Johannes Köster
  • 1,809
  • 6
  • 8
3

Alternative solution using wildcards:

rule all:
  input: 'out/genomes.msh'

rule sketch:
  input:
    '{file}.txt'
  output:
    '{file}.msh'
  shell:
    'mash sketch -l {input} -k 31 -s 100000 -o {wildcards.file}'

Untested, but I think this should work.

The advantage over the params solution is that it generalizes better.

Michael Schubert
  • 2,726
  • 4
  • 27
  • 49
2

Avoid duplicating text. Don't use params unless you convert your input/outputs to wildcards + extentions. Otherwise you're left with a rule that is hard to maintain.

input:
    "{pathDIR}/{genome}.txt"
output:
    "{pathDIR}/{genome}.msh"
params:
    dir: '{pathDIR}/{genome}'

Otherwise, use Python's slice notation.

I couldn't seem to get slice notation to work in the params using the output wildcard. Here it is in the run directive.

from subprocess import call

rule sketch:
  input:
    'out/genomes.txt'
  output:
    'out/genomes.msh'
  run:
    callString="mash sketch -l " + str(input) + " -k 31 -s 100000 -o " + str(output)[:-4]
    print(callString)
    call(callString, shell=True)

Python underlies Snakemake. I prefer the "run" directive over the "shell" directive because I find it really unlocks a lot of that beautiful Python functionality. The accessing of params and various things are slightly different that with the "shell" directive.

E.g.

callString=config["mpileup_samtoolsProg"] + ' view -bh -F ' + str(config["bitFlag"]) + ' ' + str(input.inputBAM) + ' ' + wildcards.chrB2M[1:] 

A bit of a snippet of J.K. using the run directive.

All of the rules in my modules pretty much use the run directive

TBoyarski
  • 441
  • 3
  • 9
2

You could remove the extension within the shell command

rule sketch:
  input:
    'out/genomes.txt'
  output:
    'out/genomes.msh'
  shell:
    'mash sketch -l {input} -k 31 -s 100000 -o $(echo "{output}" | sed -e "s/.msh//")'
Jan Schreiber
  • 203
  • 2
  • 7
  • This is so far the most elegant solution, although it should be possible in principle to perform very basic operations on input/output. Sad! – mgalardini Sep 27 '17 at 11:07