1

I'm quite new to bioinformatics and Snakemake both, but I'm trying to put together an automated pipeline for Tn-seq data analysis.

I've written a script that reads in a .bedgraph file and outputs separate files for each contig, as I'd like to analyze each contig separately. I've written the code to output a file with a basename of the input file + the contig name:

input_handle = FILE
path = PATH

import csv
import re

contigs = {}

with open(input_handle) as data:
    data_reader = csv.reader(data, delimiter='\t')
    contigs = {row[0] for row in data_reader}

for c in contigs:
    with open(input_handle) as data:
        data_reader = csv.reader(data, delimiter='\t')
        out_file = path + re.search(r".+\/(.+)(?=.bedgraph)", input_handle).group(1) + "-" + c + ".bedgraph"
        f_out = open(out_file, 'w')
        for row in data_reader:
            if row[0] == c:
                f_out.write("\t".join(row)+"\n")

I'm struggling to figure out how to incorporate this into Snakemake appropriately. I'm also pretty unclear on how to then incorporate that output within my script.

EDIT: I had previously said I thought I should be using dynamic but after looking around more it appears that is deprecated.

rule split_bed:
    input:
        "bam_coverage/{sample}.bedgraph"
    output:
---->   "split_beds/{sample}/?????"
    script:
        "scripts/split_bed.py"

calls:

input_bedgraph = snakemake.input[0]

import csv
import re

contigs = {}

with open(input_bedgraph) as data:
    data_reader = csv.reader(data, delimiter='\t')
    contigs = {row[0] for row in data_reader}

for c in contigs:
    with open(input_bedgraph) as data:
        data_reader = csv.reader(data, delimiter='\t')
---->   out_file = snakemake.output[0]
        f_out = open(out_file, 'w')
        for row in data_reader:
            if row[0] == c:
                f_out.write("\t".join(row)+"\n")

If anyone could point my in the right direction it would be much appreciated! The tutorial was great for getting started and I've got plenty of rules that work well before this, but I'm a bit lost at this point.

EDIT2:

I've confirmed that simply using the following with an "all" rule expanding the split_beds folder works to make a single file with a single contig from each, so the script works fine... Just need to be able to do multiple output into different folders...

rule split_bed:
    input:
        "bam_coverage/{sample}.bedgraph"
    output:
        "split_beds/{sample}.bedgraph"
    script:
        "scripts/split_bed.py"

2 Answers2

1

After getting some help from someone who knows what they're doing the problem became a bit easier to solve.

I did end up using checkpoints (thanks @Luigi). I believe the problem was probably simplified by using an awk one liner instead of a script but using the code below along with a modified all rule ended up getting me all of the files that I needed.

checkpoint split_bed:
    input:
        "bedgraphs/{sample}.bedgraph"
    output:
        directory("split_beds/{sample}/")
    shell:
        "mkdir split_beds/{wildcards.sample}; awk \'{{print $0 > \"split_beds/{wildcards.sample}/{wildcards.sample}_\"$1\".bedgraph\"}}\' {input}"

def aggregate_input(wildcards):
    checkpoint_output = checkpoints.split_bed.get(**wildcards).output[0]
    return expand("split_beds/{sample}/{contig}.bedgraph",
           sample=wildcards.sample,
           contig=glob_wildcards(os.zpath.join(checkpoint_output, "{contig}.bedgraph")).contig)

rule aggregate:
    input:
        aggregate_input
    output:
        "split_beds/{sample}/{contig}"

And the "all" rule is:

rule all:
    input:
        expand("split_beds/{sample}/", sample=config["samples"])
0

Do you need to use the contigs downstream in your snakemake pipeline? If so, dynamic() is definitely the way to go. This related question is a very accessible intro to dynamic(). It'd potentially look something like this:

rule split_bed:
    input:
        "bam_coverage/{sample}.bedgraph"
    output:
        dynamic("split_beds/{sample}/{contig}.fasta")
    script:
        "scripts/split_bed.py"

rule process_contigs:
    input:
        dynamic("split_beds/{sample}/{contig}.fasta")
    ...

If my understanding is correct, snakemake understands this as "process_contigs needs some number of files with names matching the pattern 'split_beds/{sample}/{contig}.fasta', which will be produced by split_bed".

Alternately, if you just want to splort out your split contigs and don't need those files processed individually within snakemake, a less elegant way is to just touch a file once you're done processing the contigs, then use that as the output for the snakemake rule (using pathlib).

Your sample processing code:

input_handle = FILE
path = PATH

...

from pathlib import Path

...

for c in contigs:

    ...

Path('split_beds/status/{sample}_completed.txt').touch()

Your rule:

rule split_bed:
    input:
        "bam_coverage/{sample}.bedgraph"
    output:
        "split_beds/status/{sample}_completed.txt"
    script:
        "scripts/split_bed.py"

I would only really do the second one if the split contigs are a terminal point of your pipeline and you don't want to bother with making another rule to take the dynamic contig files as input.

Edit:

I've just remembered that checkpoints exist. I think these replaced dynamic()--checkpoints basically allow snakemake to re-evaluate the DAG at certain points of the workflow, which I think is what you need, since you'll be making a variable number of individual contig files for each sample. Thus, the DAG can't be finalized at the beginning of the run (I think this is what checkpoints are designed to solve). I haven't used them myself so I'm not going to mislead you by trying to psuedocode an example, but hopefully this points you in the right direction!

Luigi
  • 4,129
  • 6
  • 37
  • 57
  • Thanks for your suggestion! I am definitely going to be using the split contigs downstream. I have another script that I don't think will be such a problem (i.e. one input, one output), that will end up outputting a wig file with the number of reads at each TA site in the genome. Currently I'm getting: `Wildcards in input files cannot be determined from output files: sample` And I'm unclear on if my actual script will output something that actually works? Sorry if this is really dumb. I've been looking but can't find good documentation for dynamic(). – seandworkman Aug 10 '21 at 19:34
  • I've also now tried defining sample with: `dynamic("split_beds/{sample}/{contig}.fasta", sample=config['samples'])` and it returns `dynamic() got an unexpected keyword argument 'sample'` ... Dynamic appears to be deprecated at this point which is why I can't seem to find documentation. – seandworkman Aug 10 '21 at 19:41
  • No worries--the DAG structure of snakemake's is not super intuitive. It sounds from that error that it's grumpy about `{sample}`--does that only occur when you use `dynamic()`? If you're still building the workflow, do you have a branch rule at the end of processing/beginning of DAG? (e.g. somewhere where you use `expand()`) – Luigi Aug 10 '21 at 19:43
  • I.e., the `bcftools_call` rule in this example: https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#summary – Luigi Aug 10 '21 at 19:44
  • might also be a version issue if Snakemake is deprecating dynamic()--in the example in your second comment, I don't think the func is expecting the `sample` arg – Luigi Aug 10 '21 at 19:45
  • 1
    Right now I have `rule all: input: expand("split_beds/{sample}/{contig}.bedgraph", sample=config["samples"])` at the very top, but it's asking about configs if I try that. I previously had the pipeline run cutadapt, bowtie, samtools, and deeptools just fine - the "all" rule was defined to use expand on the .bedgraph files created out of deeptools. – seandworkman Aug 10 '21 at 19:52
  • I think checkpoints are your solution: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution. I edited my answer with a bit more detail – Luigi Aug 10 '21 at 20:13
  • Thanks, I'll give this a go! I tried something like [link]https://stackoverflow.com/questions/49574969/snakemake-unknown-output-input-files-after-splitting-by-chromosome but it just resulted in my fans revving up for a while and my terminal hanging. – seandworkman Aug 10 '21 at 20:27
  • yeah, I feel like a lot of me snakemake-ing is cobbling together things that probably could be elegantly done if I was more familiar with the Snakemake architecture. Best of luck! – Luigi Aug 10 '21 at 20:31