Snakemake expand+zip function unexpected behavior

Question

I am trying to use Snakemake to process calls to the rnaQUAST tool with multiple inputs delineated by two sets of different, but paired keywords. I do not want all combinations of these keywords, only specific combinations. It is my understanding that I need to specify the use of zip within the expand() call in my rule all as below. However, in interpreting the variables snakemake appears to populate the {sample} and {reference} wildcards in an unexpected way:

samples_rnaQUAST = ["TMW3250_15","TMW3250_20","TMW3256_15","TMW3256_20","TMW3261_15","TMW3261_20",
                    "TMW3673_15","TMW3673_20","TMW3285_15","TMW3285_20","TMW3275_15","TMW3275_20",
                    "TMW3681_15","TMW3681_20","TMW3287_15","TMW3287_20"]
references_rnaQUAST = ["German_ale","German_ale","German_ale","German_ale",
                       "English_ale","English_ale","American_ale","American_ale",
                       "Frohberg","Frohberg","Frohberg","Frohberg","Saaz",
                       "Saaz","Saaz","Saaz"]

rule all:
    input:
        expand("rnaquast/{sample}{reference}/short_report.txt", zip, sample=samples_rnaQUAST, reference=references_rnaQUAST)

rule rnaQUAST:
    input:
        transcriptome="trinity/{sample}/default_by_condition_trinity/Trinity.fasta",
        reference="genomes/{reference}_genome.fasta",
        gtf="genomes/AUGUSTUS_annotations/{reference}.gtf"
    output:
        report="rnaquast/{sample}{reference}/short_report.txt"
    threads: 16
    shell:"""
    /home/user/miniconda3/envs/rnaquast/share/rnaquast-1.5.1-0/rnaQUAST.py \
    --transcripts {input.transcriptome} \
    --reference {input.reference} \
    --gtf {input.gtf} \
    -t {threads} \
    -o rnaquast/{output.report}
    """

With snakemake 5.10.0, I am receiving the following output populating {sample} and {reference} wildcards:

Building DAG of jobs...
MissingInputException in line 65 of /home/user/analyses/Snakefile:
Missing input files for rule rnaQUAST:
genomes/e_genome.fasta
trinity/TMW3250_15German_al/default_by_condition_trinity/Trinity.fasta
genomes/AUGUSTUS_annotations/e.gtf

Why is snakemake splitting the inputs in this unexpected way, and misallocating portions of the strings input to wildcards in rule all?

score 2 · Answer 1 · answered Jun 08 '22 at 03:55

The key problem is due to this line:

report="rnaquast/{sample}{reference}/short_report.txt"

The combination of {sample} and {reference} without any demarcation is ambiguious, so for snakemake it's not obvious how to split the desired result. For example, what should be {TMW3250_15}{German_ale} is interpreted by snakemake as {TMW3250_15German_al}{e}. There are at least two solutions:

Introduce clear separation between wildcards to make parsing unambiguous, e.g. define outputs as "rnaquast/{sample}+{reference}/short_report.txt", or use any other symbol instead of + as long as that symbol doesn't appear in the wildcard values.
Use wildcard_constraints, this one is a bit trickier to handle generally, but in your case this would look like:

# put this before rule definitions
wildcard_constraints:
    sample="|".join(set(samples_rnaQUAST))

I used your first solution and it worked -- I did not realize the issue with using the same symbol as is in the wildcard values to combine the wildcards in a string. Thank you very much! — Snail Shaman, Jun 10 '22 at 13:40

Snakemake expand+zip function unexpected behavior

1 Answers1