2

I've been trying to manually create snakemake wildcards by importing a tab-delimited file that looks as follows:

dataset sample species frr

PRJNA493818_GSE120639_SRP162872 SRR7942395_GSM3406786_sAML_Control_1 Homo_sapiens 1 PRJNA493818_GSE120639_SRP162872 SRR7942395_GSM3406786_sAML_Control_1 Homo_sapiens 2 PRJNA362883_GSE93946_SRP097621 SRR5195524_GSM2465521_KrasT_45649_NoDox Mus_musculus 1 PRJNA362883_GSE93946_SRP097621 SRR5195524_GSM2465521_KrasT_45649_NoDox Mus_musculus 2

This is how my snakemake file looks like (minimal example):

import pandas as pd
import os

# --- Importing Configuration Files --- #
configfile: "/DATA/config/config.yaml"

table_cols = ['dataset','sample','species','frr']
table_samples = pd.read_table('/DATA/config/samples.tsv', header=0, sep='\t', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
print(DATASET,SAMPLE,SPECIES,FRR)

rule all:
        input:
                expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR)

## fastq files quality control
rule rawFastqc:
        input:
                rawread=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz"
        output:
                zip=config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.zip",
                html=config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html"
        threads:
                12
        params:
                path=config["project_path"]+"results/{dataset}/rawQC/"
        conda:
                "envs/bulkRNAseq.yaml"
        shell:
                """
                fastqc {input.rawread} --threads {threads} -o {params.path}
                """

When I run:

snakemake -s test --use-conda -n -p

This is the output:

['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2]
Building DAG of jobs...
Job counts:
    count   jobs
    1   all
    4   rawFastqc
    5

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz
    output: /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.zip, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.html
    jobid: 3
    wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus, species=musculus, frr=1
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz --threads 12 -o /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/
        

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz
    output: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.zip, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.html
    jobid: 1
    wildcards: dataset=PRJNA493818_GSE120639_SRP162872, sample=SRR7942395_GSM3406786_sAML_Control_1_Homo, species=sapiens, frr=1
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz --threads 12 -o /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/
        

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz
    output: /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.zip, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.html
    jobid: 4
    wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus, species=musculus, frr=2
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --threads 12 -o /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/
        

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz
    output: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.zip, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.html
    jobid: 2
    wildcards: dataset=PRJNA493818_GSE120639_SRP162872, sample=SRR7942395_GSM3406786_sAML_Control_1_Homo, species=sapiens, frr=2
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz --threads 12 -o /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/
        

[Thu Aug 11 00:57:30 2022]
localrule all:
    input: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.html, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.html, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.html, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.html
    jobid: 0

Job counts:
    count   jobs
    1   all
    4   rawFastqc
    5
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

It's clear that print(DATASET,SAMPLE,SPECIES,FRR) produces my desired wildcard values:

['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2]

However subsequently snakemake does not take these into account and produces the wrong wildcard values, despite the fact I'm not using glob_wildcards.

I'm clearly missing something, but I can't figure out what I'm doing wrong. I've also looked into the following post: Manually create snakemake wildcards .

Help would be much appreciated!

manvenk
  • 69
  • 6

3 Answers3

3

This is an attempt to explain why snakemake generates wildcard values that do not match the user's values. Consider this tiny example.

We want to create file a_b.A_B.txt and this Snakefile will do the job:

FOO = ['a_b']
BAR = ['A_B']

rule all:
    input:
        expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),

rule one:
    output:
        '{x}_{y}.txt',
    shell:
        r"""
        touch {output}
        """

Try it with snakemake -p -j 1 -n:

Building DAG of jobs...
...
[Thu Aug 11 09:43:30 2022]
rule one:
    output: a_b.A_B.txt
    jobid: 1
    reason: Missing output files: a_b.A_B.txt
    wildcards: x=a_b.A, y=B
    resources: tmpdir=/tmp


        touch a_b.A_B.txt
...

It works, but a couple of counterintuitive things happen:

  • Wildcard names in rule all, {foo} and {bar}, do not match those in rule one, {x} and {y}. Snakemake doesn't care about it, it just sees wildcards that can take any value.

  • Rule all asks for file {wildcard1}.{wildcard2} (note the dot .) but rule one has output {wildcard1}_{wildcard2} (note the underscore _). There seem to be a mismatch and the script should fail. Instead it works because...

  • Snakemake has freedom to find regex values that match inputs to outputs. In this case, assigning wildcards: x=a_b.A and y=B will match the requested a_b.A_B.txt. (In my opinion this is counterintuitive).

Rewriting this snakefile using constraints leads to less surprising behavior:

FOO = ['a_b']
BAR = ['A_B']

wildcard_constraints:
    x='|'.join([re.escape(x) for x in FOO]),
    y='|'.join([re.escape(x) for x in BAR]),

rule all:
    input:
        expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),

rule one:
    output:
        '{x}_{y}.txt',
    shell:
        r"""
        touch {output}
        """

Fails with:

snakemake -p -j 1 -n

Building DAG of jobs...
MissingInputException in line 8 of /home/dario/Downloads/Snakefile:
Missing input files for rule all:
    affected files:
        a_b.A_B.txt

This version works and uses the expected wildcard values:

FOO = ['a_b']
BAR = ['A_B']

wildcard_constraints:
    x='|'.join([re.escape(x) for x in FOO]),
    y='|'.join([re.escape(x) for x in BAR]),

rule all:
    input:
        expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),

rule one:
    output:
        '{x}.{y}.txt', # Use dot, not underscore
    shell:
        r"""
        touch {output}
        """
snakemake -p -j 1 -n
...
rule one:
    output: a_b.A_B.txt
    jobid: 1
    reason: Missing output files: a_b.A_B.txt
    wildcards: x=a_b, y=A_B
    resources: tmpdir=/tmp


        touch a_b.A_B.txt
...

Note that we get the expected wildcard values x=a_b and y=A_B. I would also use consistent naming of wildcards (foo, bar OR x, y) unless there is a good reason not to.

The command x = '|'.join([re.escape(x) for x in FOO]) tells snakemake that wildcard x can only take values that match the regular expression '|'.join([re.escape(x) for x in FOO]) (| it's pipe symbol, not I). So that value a_b.A will not match and the script will fail. escape is to ensure that special characters like . and * are not interpreted as regexes.


This is my understanding - I think it would good to have a dedicated documentation about it...

dariober
  • 8,240
  • 3
  • 30
  • 47
2

As mentioned here, the naming should be improved. If there is no way around this, one option is to apply wildcard_constraints:

# regex-safe approach as suggested by @dariober
from re import escape

UNIQ_SPECIES = [escape(x) for x in table_samples.species.unique()]

wildcard_constraints:
    species="|".join(UNIQ_SPECIES)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thank you for the suggestion, however this doesn't solve the root of the issue. Why is snakemake ignoring my manually created wildcards and how do I make sure it doesn't ignore those? – manvenk Aug 11 '22 at 05:25
  • It does resolve the root issue, because the root issue is ambiguity in parsing "{x}_{y}" from "a_b_c". Should x be "a_b" or "a"? `snakemake` can't know without `wilcard_constraints`. – SultanOrazbayev Aug 11 '22 at 05:42
  • I appreciate renaming would solve this issue. The problem is that I've got hundreds of files with this sort naming. I'd first prefer to solve it using manually created wildcards. Regardless, I'm curious why snakemake ignores my manually created wildcards? – manvenk Aug 11 '22 at 05:50
  • That's a design issue: snakemake infers wildcards from outputs, so the fact that you have manually created wildcards is useful for `expand()`, but doesn't help at the rule level. – SultanOrazbayev Aug 11 '22 at 05:52
  • 1
    *I appreciate renaming would solve this issue*: You don't have to rename. In fact, I object to doing it. If samples have certain names, changing them is going to cause confusion later. Setting `wildcard_constraints` will fix it and I would use `species="|".join([re.escape(x) for x in UNIQ_SPECIES])`. I'm not sure about the exact mechanisms, but the issue is that snakemake tries to match filenames to wildcards as regular expressions, not as fixed strings. If you want fixed strings you have to constrain (and I would prefer this being the default). – dariober Aug 11 '22 at 06:21
  • I object to your objection. Using "_" to perform two roles (separating wildcards and separating words/keys within a wildcard) is a source of unnecessary ambiguity. It's usually cheap to correct, though I appreciate there are cases where that's not feasible... and then constraining wildcards is a way forward. – SultanOrazbayev Aug 11 '22 at 06:25
  • 1
    @SultanOrazbayev - We had this exchange before and it looks like we are going to agree to disagree! – dariober Aug 11 '22 at 06:35
  • Hehe, well, I'd be curious to know more. I might ask a separate question on this, but right now not quite sure how to formulate the right question. – SultanOrazbayev Aug 11 '22 at 06:43
  • I would agree with dariober. It makes me nervous changing hundreds of file names, it's a risky step in the process... Afterwards, it will be very difficult retracing mistakes. So do I understand that currently it's IMPOSSIBLE (?) to manually make wildcards and that snakemake simply overrides? I'm so surprised of this behavior without even using glob_wildcards. – manvenk Aug 11 '22 at 06:55
  • 1
    @manvenk It's fine to make wildcards manually as you do in your question and in my opinion this is better than `glob`'ing anything that looks like an input file. Just add an entry to `wildcard_contraints` to tell snakemake to use the exact values you created. So I would add a line `sample="|".join([re.escape(x) for x in SAMPLES])` and the same for `dataset` and `frr`. I do this all the time and it works for me. – dariober Aug 11 '22 at 07:09
  • @SultanOrazbayev and @dariober, both `UNIQ_SPECIES = [escape(x) for x in table_samples.species.unique()] wildcard_constraints: species="|".join(UNIQ_SPECIES)` and `sample="|".join([re.escape(x) for x in SAMPLES])` work for me, thanks! I looked up the meaning of the individual function, however I still don't understand what the `"I"` does in this function? – manvenk Aug 11 '22 at 08:58
  • 1
    It's a pipe symbol, which in regex means OR. "A|B" = A OR B – SultanOrazbayev Aug 11 '22 at 09:04
  • @manvenk, SultanOrazbayev see also my answer - any suggestion for improvement much welcome. – dariober Aug 11 '22 at 09:16
0

This is an alternative approach that leverages python side of the Snakefile. The idea here is to construct a set of rules that generate a specific output.

# --- Importing Configuration Files --- #
configfile: "/DATA/config/config.yaml"

table_cols = ['dataset','sample','species','frr']
table_samples = pd.read_table('/DATA/config/samples.tsv', header=0, sep='\t', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
print(DATASET,SAMPLE,SPECIES,FRR)

rule all:
        input:
                expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR)

## fastq files quality control
for dataset, sample, species, frrr in zip(DATASET,SAMPLE,SPECIES,FRR):
   rule:
      name: f'{dataset}+{sample}+{species}+{frrr}'  # look how easy it's to find another symbol to separate the wildcards
      output:
         # I am simplifying this as formatting nested code is tricky
         # but the key idea is that now you want to apply something like f-string formatting to essentially hard-code specific outputs for specific rules
         abc=f"{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.zip"
      shell: # make sure to hard-code shell contents also with f-string formatting
         f"echo 'dataset is {dataset}'"

This approach is less desirable, however, since the number of rules grows fast (one rule for each desired output) and in general debugging these rules might get trickier. The purpose of the code above is to show that in principle it's possible to avoid using wildcard_constraints even when working with filenames that are difficult/ambiguous for snakemake to parse.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46