I'm quite new to bioinformatics and Snakemake both, but I'm trying to put together an automated pipeline for Tn-seq data analysis.
I've written a script that reads in a .bedgraph file and outputs separate files for each contig, as I'd like to analyze each contig separately. I've written the code to output a file with a basename of the input file + the contig name:
input_handle = FILE
path = PATH
import csv
import re
contigs = {}
with open(input_handle) as data:
data_reader = csv.reader(data, delimiter='\t')
contigs = {row[0] for row in data_reader}
for c in contigs:
with open(input_handle) as data:
data_reader = csv.reader(data, delimiter='\t')
out_file = path + re.search(r".+\/(.+)(?=.bedgraph)", input_handle).group(1) + "-" + c + ".bedgraph"
f_out = open(out_file, 'w')
for row in data_reader:
if row[0] == c:
f_out.write("\t".join(row)+"\n")
I'm struggling to figure out how to incorporate this into Snakemake appropriately. I'm also pretty unclear on how to then incorporate that output within my script.
EDIT: I had previously said I thought I should be using dynamic but after looking around more it appears that is deprecated.
rule split_bed:
input:
"bam_coverage/{sample}.bedgraph"
output:
----> "split_beds/{sample}/?????"
script:
"scripts/split_bed.py"
calls:
input_bedgraph = snakemake.input[0]
import csv
import re
contigs = {}
with open(input_bedgraph) as data:
data_reader = csv.reader(data, delimiter='\t')
contigs = {row[0] for row in data_reader}
for c in contigs:
with open(input_bedgraph) as data:
data_reader = csv.reader(data, delimiter='\t')
----> out_file = snakemake.output[0]
f_out = open(out_file, 'w')
for row in data_reader:
if row[0] == c:
f_out.write("\t".join(row)+"\n")
If anyone could point my in the right direction it would be much appreciated! The tutorial was great for getting started and I've got plenty of rules that work well before this, but I'm a bit lost at this point.
EDIT2:
I've confirmed that simply using the following with an "all" rule expanding the split_beds folder works to make a single file with a single contig from each, so the script works fine... Just need to be able to do multiple output into different folders...
rule split_bed:
input:
"bam_coverage/{sample}.bedgraph"
output:
"split_beds/{sample}.bedgraph"
script:
"scripts/split_bed.py"