How to gather files from subdirectories to run jobs in Snakemake?

Question

I am currently working on this project where iam struggling with this issue.

My current directory structure is

 /shared/dir1/file1.bam
 /shared/dir2/file2.bam
 /shared/dir3/file3.bam

I want to convert various .bam files to fastq in the results directory

  results/file1_1.fastq.gz 
  results/file1_2.fastq.gz
  results/file2_1.fastq.gz 
  results/file2_2.fastq.gz
  results/file3_1.fastq.gz 
  results/file3_2.fastq.gz

I have the following code:

END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")

rule all:
    input: expand( "/results/{sample}_{end}.fastq.gz",sample=files,  end=END)

rule bam_to_fq:
    input:  {dir}/{sample}.bam"
    output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
    shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"

This outputs the following error:

Wildcards in input files cannot be determined from output files:
'dir'

Any help would be appreciated

If your bam data is expected to always be in the `/shared/dir/file.bam`, then maybe having `` as wildcard could lead to "cleaner" code. — bli, Aug 23 '17 at 08:52

TBoyarski · Answer 1 · 2017-08-22T16:48:21.320

You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.

    input:  
         "{dir}/{sample}.bam"
    output: 
         left="/results/{sample}_1.fastq", 
         right="/results/{sample}_2.fastq",

Rule of thumb: input and output wildcards must match

rule all:
    input: 
         expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)

rule bam_to_fq:
    input:  
         expand("{dir}/{{sample}}.bam", dir=dirs)
    output: 
         left="/results/{sample}_1.fastq", 
         right="/results/{sample}_2.fastq"
    shell: 
         "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}

NOTES

the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.

EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample

I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:

Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.

Generalizing bam_to_fq a little bit... utilizing an external config, something like....

from pandas import read_table

rule all:
    input: 
         expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])

rule bam_to_fq:
    input:  
         "{dir}/{sample}.bam"
    output: 
         left="/results/{dir}/{sample}_1.fastq", 
         right="/results/{dir}/{sample}_2.fastq"
    shell: 
         "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}

sampleFILE

dir file
dir1 file1
dir2 file2
dir3 file3

Thanks unfortunately now I have the following error: `( Missing input files for rule bam_to_fq: /dir1/file1.bam, /dir2/file1.bam, /dir3/file1.bam.)` It seems that the dir was expanded but sample was not, I need the dir to be paired with the file from which it originated. — yevishere, Aug 21 '17 at 21:34
Okay, that requirement drastically complicates the answer. You should not be using glob_wildcards if this is the case. Glob_wildcards do not lend themselves well to the pairing of dir and sample names. It's much harder, and I suggest using pandas to accomplish it. An example using pandas to read key-value pairs from a file can be seen [here](https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command/45513130#45513130). — TBoyarski, Aug 21 '17 at 21:43
Ok, thanks, that makes me feel better. =) I thought I was loosing my mind there for a second, I was already working on pandas solution as a backup. — yevishere, Aug 21 '17 at 21:50
Heh, thats the beauty of the underlying "GNU Make"-like system that the author built upon. Using a regex-type mechanism, it will attempt to see if the explicit string in the rule "all" can be regex'd using the greedy wildcards in the output of "bam_to_fq". This is why the author has also made available wildcard_constaints, something I heavily rely on to control the wildcards in a 100+ step pipeline. — TBoyarski, Aug 22 '17 at 18:57

How to gather files from subdirectories to run jobs in Snakemake?

1 Answers1