8

I recently decided to start with snakemake. I can't find anything that fits my needs neither on stack, nor on the snakemake doc. I feel like I don't understand something and I may need some explanations.

I am trying to make a simple snakemake workflow that take as input, a fastq file and a sequencing-summary file (that contains infos about the reads) and filter the reads within the fast into several file (low.fastq and high.fastq).

My input data and my Snakefile I'm trying to execute are stored like this :

.
├── data
│   ├── sequencing-summary-example.txt 
│   └── tiny-example.fastq 
├── Snakefile
└── split_fastq

And this is what I've tried so far :

*imports*
rule targets:
    input:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"
    run:
        * do the thing *

I expected to have a directory "split_fastq" filled with a "low" and a "high" fastq. But instead I got the error :

Building DAG of jobs...
WildcardError in line 10 of /work/sbsuser/test/roxane/alignement-ont/Snakefile:
Wildcards in input files cannot be determined from output files:
'reads'

Even though it seems to be a very popular error, I'm not sure if I don't understand how to use wildcards or if there is an other problem. Am I using the "input" and "output" correctly ?

halfer
  • 19,824
  • 17
  • 99
  • 186
Roxane
  • 111
  • 7
  • 2
    I answered a very similar question recently: https://stackoverflow.com/questions/57327210/snakemake-dynamically-derive-the-targets-from-input-files/57332114#57332114 – Maarten-vd-Sande Aug 27 '19 at 14:14
  • 1
    Yes, I have been reading this thread, but I was puzzled by the fact you had to write in the stone the name of your file into the workflow. I want to use this workflow for any fastq, I don't wanted to write a path or a file name, I would like something dynamic, a bit like an -f myFastq option for example. But maybe snakenamke isn't meant for that and is more "specific dataset oriented" ? Maybe thats what I am not understanding correctly ? – Roxane Aug 27 '19 at 14:20
  • 2
    There are (at least) two options. The first is that the output has the same wildcards as the input. Then you can run something like `snakemake split_fastq/low-example.fastq`. The other is that you use the globbing approach in the link, which then will find the relevant samples, and decides which output needs to be generated. – Maarten-vd-Sande Aug 27 '19 at 14:30
  • 1
    Alright, so to replace in my example, I would write on the output section split_fastq/{reads}-low.fastq","split_fastq/{reads}high.fastq" is that correct ? – Roxane Aug 27 '19 at 14:40
  • 1
    Kinda, you also need a wildcard for the summary.txt, and each output needs both wildcards (reads, and seqsum). Take a look again at the docs how wildcards work. I unfortunately do not have time to type out an answer. – Maarten-vd-Sande Aug 27 '19 at 14:43
  • 1
    No problem, thank for your enlightments, I'll keep digging in there. Thanks for your time ! – Roxane Aug 27 '19 at 14:58
  • Besides the good explanations given in Colin's answer, just in case this might help you understand how snakemake works, I have made a proposal to add some information to the official documentation, but haven't had time to improve it yet: https://bitbucket.org/blaiseli/snakemake/src/f11247997a378c48fe0f1dc4f921f0cb64e19a37/docs/snakefiles/understanding.rst – bli Aug 30 '19 at 08:42

1 Answers1

7

The problem is that you have the wildcard in the input, but not in the output. Wildcards are required in the output. Think about it this way, by putting the wildcard in the input, you're creating a rule that you are intending to be run individually on many different fastq files. But the output files for that rule will be exactly the same file for each of those different fastq files. They'll overwrite each other! You want to incorporate the wildcard into your output files so you get a unique file for each possible input, for example:

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

Now with tiny-example.fastq as your input, you'll get tiny-example.low.fastq and tiny-example.high.fastq as output. And if you add a second fastq file, you'll get different high and low output files for that one. But this rule still won't work because the "seqsum" wildcard is also not part of the output. What you'll probably want to do in this case is have the sequence-summary-example.txt incorporate the name of the fastq file, for example call it sequence-summary-tiny-example.txt. Now you can make your rule like this:

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/sequence-summary-{reads}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

And now if you then add an other-example.fastq and sequence-summary-other-example.txt, your snakemake pipeline should be able to create other-example.low.fastq and other-example.high.fastq.

Snakemake always thinks backwards from how we tend to think. We first think about the input, and then what output it creates. But Snakemake knows what file it needs to make, and it's trying to figure out what input it needs to make it. So in your original rule, it knew it needed to make low.fastq, and it saw that the split_fastq rule could make that, but then it didn't know what the wildcard "reads" in the input should be. Now, in the new rule, it knows it needs to make tiny-example.low.fastq and sees that split_fastq can create output files of the template {reads}.low.fastq, so it says "Hey, if I make reads = tiny-example, then I can use this rule!" and then it looks at the input and says "Ok, since for input I need {reads}.fastq and I know reads = tiny-example then that means for input I need tiny-example.fastq, and I have that!"

Colin
  • 10,447
  • 11
  • 46
  • 54
  • Thank you for your very detailed answer that helped me understanding. I still don't understand everything, because I feel like at some point I have to give him some files names written in the stone. I modified my script as you suggested : But it still give me this as an error : "Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards." Is there a good way to build my script ? Or should I call snakemake an output file I expect him to create when calling him instead of just using "snakemake" ? – Roxane Sep 02 '19 at 08:56
  • When you run snakemake, it will try to run the first rule. What you had in you original post is correct, where you define a rule named "targets" or something similar with only input files that are your "set in stone" filenames. You can also give snakemake specific files when execute the command, such as "snakemake split_fastq/tiny-example.low.fastq". – Colin Sep 03 '19 at 17:13