snakemake script access to stdin / stdout for stream processing

Question

For a Snakemake workflow, I need to manipulate tags in many BAM files, and would like to process these by piping them through a script (using the Snakemake script: directive). The specific way I'm doing this is with pysam stream processing.

infile = pysam.AlignmentFile("-", "rb")
outfile = pysam.AlignmentFile("-", "wb", template=infile)

for s in infile:
    (flowcell, lane) = s.query_name.split(':')[0:2]
    rg_id = ".".join([flowcell, lane])
    s.set_tag('RG',rg_id,'Z')
    outfile.write(s)

This script works well standalone, but I haven't been able to figure out how to integrate it via the snakemake script directive. I prefer this way to minimize IO and RAM usage.

Edit: Resorted to direct loading to fix the RG tag.

# parameters passed from snakemake
bam_file = snakemake.input[0]
fixed_bam_file = snakemake.output[0]

bamfile  = pysam.AlignmentFile(bam_file, "rb")
fixed_bamfile = pysam.AlignmentFile(fixed_bam_file, "wb", template = bamfile)

for i, read in enumerate(bamfile.fetch()):
    (flowcell, lane) = read.query_name.split(':')[0:2]
    rg_id = ".".join([flowcell, lane])
    read.set_tag('RG',rg_id,'Z')
    fixed_bamfile.write(read)
    if not (i % 100000):
        print("Updated the read group for {} reads to {}".format(i, rg_id))

bamfile.close()
fixed_bamfile.close()

EDIT: Snakemakes run: and shell: directives set the workdir: directory, while the script: directive operates relative to the directory where the Snakefile was executed (keeping everything nice and tidy). Hence the problem of putting a stream processor under script:.

Manavalan Gajapathy · Answer 1 · 2019-01-16T04:10:19.833

1

Using shell instead of script directive:

rule all:
    input:
        expand('{sample}_edited.bam'), sample=['a', 'b', 'c']


rule somename:
    input:
        '{sample}.bam'
    output:
        '{sample}_edited.bam'
    shell:
        '''
        cat {input} > python edit_bam.py > {output}
        '''

edited Jan 16 '19 at 04:10

answered Jan 15 '19 at 22:04

Manavalan Gajapathy

3,900
2
20
43

Thank you for the pointer. I tried using the run directive as suggested, with the shell() command to execute my command. However, the run directive executes under the working directory, and therefore cannot access the script directory relative to the Snakefile. – Krischan Jan 16 '19 at 00:10
Meant to say `shell`, not `run` directive. Fixed it now. Where is your python script located relative to snakefile? Showing your current setup would help. – Manavalan Gajapathy Jan 16 '19 at 04:10
The script is in a subfolder of the Snakefile directory, passed with a relative path to `script`. The `run` and `shell` directives will use the `workdir` location, unfortunately. I have now changed the tag modification script to inline processing, with parameters passed through the `snakemake` object (`snakemake.input[0]`). – Krischan Jan 16 '19 at 22:49
I'm not familiar with pysam, but examples in your question seem to be contradicting. You mentioned pysam can only use data stream but, unlike your first example, you are passing filenames directly in updated example. Which one is correct? – Manavalan Gajapathy Jan 18 '19 at 05:42
Thank you for your suggestions @JeeYem. The pysam library can actually do both stream and file based processing of BAM files. I resorted to file based, as it was easier under `script:`. – Krischan Jan 28 '19 at 23:55

score 1 · Answer 2 · answered Jan 17 '19 at 15:09

@Krischan it seems you found a solution already and if so maybe good to post it as an answer.

Alternatively, you can use the object {workflow} to get the directory of the Snakefile and from there construct the path to your python script. If your directory structure is:

./
├── Snakefile
├── data
│   └── sample.bam
└── scripts
    └── edit_bam.py

The Snakefile may look like:

rule all:
    input:
        'test.tmp',

rule one:
    input:
        'sample.bam',
    output:
        'test.tmp',
    shell:
        r"""
        cat {input} \
        | {workflow.basedir}/scripts/edit_bam.py >  {output}
        """

Executed with snakemake -d data ...

It seems the workflow object is not documented but check this thread Any way to get the full path of the Snakefile within the Snakefile?

Thank you @dariober, I appreciate the answer. I didn't want to go this route, as I prefer the vanilla Snakemake facilities. I would not have known how to access the absolute path, thank you for showing this. — Krischan, Jan 28 '19 at 23:44

snakemake script access to stdin / stdout for stream processing

2 Answers2