For a Snakemake workflow, I need to manipulate tags in many BAM files, and would like to process these by piping them through a script (using the Snakemake script: directive). The specific way I'm doing this is with pysam stream processing.
infile = pysam.AlignmentFile("-", "rb")
outfile = pysam.AlignmentFile("-", "wb", template=infile)
for s in infile:
(flowcell, lane) = s.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
s.set_tag('RG',rg_id,'Z')
outfile.write(s)
This script works well standalone, but I haven't been able to figure out how to integrate it via the snakemake script
directive.
I prefer this way to minimize IO and RAM usage.
Edit: Resorted to direct loading to fix the RG tag.
# parameters passed from snakemake
bam_file = snakemake.input[0]
fixed_bam_file = snakemake.output[0]
bamfile = pysam.AlignmentFile(bam_file, "rb")
fixed_bamfile = pysam.AlignmentFile(fixed_bam_file, "wb", template = bamfile)
for i, read in enumerate(bamfile.fetch()):
(flowcell, lane) = read.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
read.set_tag('RG',rg_id,'Z')
fixed_bamfile.write(read)
if not (i % 100000):
print("Updated the read group for {} reads to {}".format(i, rg_id))
bamfile.close()
fixed_bamfile.close()
EDIT: Snakemakes run:
and shell:
directives set the workdir:
directory, while the script:
directive operates relative to the directory where the Snakefile was executed (keeping everything nice and tidy). Hence the problem of putting a stream processor under script:
.