6

Lets say I had the following command running from the shell

{ 
samtools view -HS header.sam;           # command1
samtools view input.bam 1:1-50000000;   # command2
} | samtools view -bS - > output.bam    # command3

For those of you who aren't familiar with samtools view (Since this is stackoverflow). What this is essentially doing is creating a new bam file that has a new header. bam files are typically large compressed files, so even passing through the file in some cases can be time consuming. One alternative approach would be to undergo command2, and then use samtools reheader to switch the header. This passes through the large file twice. The above command passes through the bam a single time which is good for larger bam files (They get to be larger then 20GB even when compressed - WGS).

My question is how to implement commands of this type in python using subprocess.

I have the following:

fh_bam = open('output.bam', 'w')
params_0 = [ "samtools", "view", "-HS", "header.sam" ]
params_1 = [ "samtools", "view", "input.bam", "1:1-50000000"]
params_2 = [ "samtools", "view", "-bS", "-" ]
sub_0 = subprocess.Popen(params_0, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
sub_1 = subprocess.Popen(params_1, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
### SOMEHOW APPEND sub_1.stdout to sub_0.stdout
sub_2 = subprocess.Popen(params_2, stdin=appended.stdout, stdout=fh_bam)

Any help is greatly appreciated. Thanks.

  • what is `fh_bam`? Why not just get the output from both and use it in the command for your last process? – Padraic Cunningham Feb 27 '15 at 21:57
  • The file handler of the output bam. Well, the first two commands are essentially reading a portion of a file and putting that into stdout. So, 'getting the output' is already available in the files. The only difference is the second command is actually grabbing a portion of the file. And that particular file is compressed, so appending files isn't as simple. – Marco Albuquerque Feb 27 '15 at 21:59
  • do want the output from the second call or both? – Padraic Cunningham Feb 27 '15 at 22:01
  • I essentially want to merge the output of the first two commands (both of which convert the compressed file into text) but, instead of writing to disc, it directly pipes to the next command which compresses the file before writing. – Marco Albuquerque Feb 27 '15 at 22:04

4 Answers4

4

If you already have the shell command in the string then you could just run it as is:

#!/usr/bin/env python
from subprocess import check_call

check_call(r"""
{ 
samtools view -HS header.sam;           # command1
samtools view input.bam 1:1-50000000;   # command2
} | samtools view -bS - > output.bam    # command3
""", shell=True)

To emulate the pipeline in Python:

#!/usr/bin/env python
from subprocess import Popen, PIPE

# start command3 to get stdin pipe, redirect output to the file
with open('output.bam', 'wb', 0) as output_file:
    command3 = Popen("samtools view -bS -".split(), 
                     stdin=PIPE, stdout=output_file)
# start command1 with its stdout redirected to command3 stdin
command1 = Popen('samtools view -HS header.sam'.split(),
                 stdout=command3.stdin)
rc_command1 = command1.wait() #NOTE: command3.stdin is not closed, no SIGPIPE or a write error if command3 dies
# start command2 after command1 finishes
command2 = Popen('samtools view input.bam 1:1-50000000'.split(),
                 stdout=command3.stdin)
command3.stdin.close() # inform command2 if command3 dies (SIGPIPE or a write error)
rc_command2 = command2.wait()
rc_command3 = command3.wait()
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

(I can't comment sadly, but this 'answer' is a comment to cmidi's answer, if somebody can move it it would be appreciated! -- PS: That answer was now deleted...)

Marco explicitly said that the commands produce a lot of output, around 20GB. If you use communicate() it will wait for the process to terminate, which means that the 'fd' descriptor will need to hold that big amount of data. In practice the OS will flush the data to disk in the meantime, unless your computer has more than 20GB free RAM. So you end up writing the intermediate data to disk, which the original author wanted to avoid. +1 for sirlark's answer!

Ariel
  • 202
  • 1
  • 8
  • Yeah, good point, especially about the pipe blocking too. You got there first – sirlark Feb 27 '15 at 23:11
  • @Ariel: FYI, [sirlark's answer won't work](http://stackoverflow.com/questions/28774716/how-to-join-the-stdout-of-two-subprocesses-and-pipe-to-stdin-of-new-subprocess-i/28779982#comment45837128_28775359) – jfs Feb 28 '15 at 09:15
0

I assume that concatenating the outputs from the first two subprocesses in memory is not feasible due to the size of the files involved. I'd suggest wrapping the outputs of the first two subproccesses in a file like. It looks like you'll only need the read method, since popen will only ever read from its stdin file-like, not seek or write. The code below assumes that returning an empty string from read is sufficient to indicate the stream is at EOF

class concat(object):
    def __init__(self, f1, f2):
        self.f1 = f1
        self.f2 = f2

    def read(self, *args):
        ret = self.f1.read(*args)
        if ret == '':
            ret = self.f2.read(*args)
        return ret

fh_bam = open('output.bam', 'w')
params_0 = [ "samtools", "view", "-HS", "header.sam" ]
params_1 = [ "samtools", "view", "input.bam", "1:1-50000000"]
params_2 = [ "samtools", "view", "-bS", "-" ]
sub_0 = subprocess.Popen(params_0, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
sub_1 = subprocess.Popen(params_1, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
### Somehow Append sub_1.stdout to sub_0.stdout
sub_2 = subprocess.Popen(params_2, stdin=concat(sub_0.stdout, sub_1.stdout), stdout=fh_bam)

To clarify, f1.read will block, and only return '' when the pipe is closed/EOF'd. concat.read only tries to read from f2 after this happens, so output from f1 and f2 wouldn't be interwoven. There is of course a slight overhead of reading the end of f1 repeatedly, which could be averted by setting a flag variable to indicate which file to read from. I doubt it would improv performance significantly though.

sirlark
  • 2,187
  • 2
  • 18
  • 28
  • Could you explain what what read(self, *args) is doing? I don't understand why you check if ret == ''. – Marco Albuquerque Feb 27 '15 at 22:45
  • 1
    There's no guarantee how popen will call call the read method of the stdin pipe file object you give it. `read` can take a size argument limiting what is read, which is exactly what you would want, otherwise everything would get read into memory. But this means that the first process' stdout may not be completely read with one call to read. So we read from f1's stdout until it's empty, and only then move on to f2's stdout... – sirlark Feb 27 '15 at 22:50
  • 1
    downvote. Popen's stdin does *not* accept a file-like object. It needs a real `.fileno()` (a real file, pipe or a socket on some systems). There are also other issues. – jfs Feb 28 '15 at 09:10
  • Yeah, @J.F.Sebastian is right :( I whipped this up on my tablet on assuming it would work and without testing. The docs say Popen accepts file-likes, but testing reveals the read/write methods are never called, but a fileno method is required. Presumably os.read() and os.write() are being used instead. I've posted another answer using a pipe instead. – sirlark Feb 28 '15 at 12:09
  • @sirlark: `os.read()` is not used by `subprocess` module. The pipe is redirected using something like `os.dup2(, 0)`. The child may read from the input however it likes e.g., stdio's `getchar()`. – jfs Mar 02 '15 at 09:04
-1

While Popen accepts file-like objects, it actually uses the underlying file handles/descriptors, not the read and write methods of the file objects to communicate, as @J.F. Sebastian rightly points out. A better way to do this is to use a pipe (os.pipe()) which doesn't use the disk. This allows you to connect the output stream directly to the input stream of another process, which is exactly what you want. The problem is then just a matter of serialisation, to make sure the two source streams don't interleave.

import os
import subprocess

r, w = os.pipe()

fh_bam = open('output.bam', 'w')
params_0 = [ "samtools", "view", "-HS", "header.sam" ]
params_1 = [ "samtools", "view", "input.bam", "1:1-50000000"]
params_2 = [ "samtools", "view", "-bS", "-" ]
sub_sink = subprocess.Popen(params_2, stdin=r, stdout=fh_bam, bufsize=4096)
sub_src1 = subprocess.Popen(params_0, stderr=subprocess.PIPE, stdout=w, bufsize=4096)
sub_src1.communicate()
sub_src2 = subprocess.Popen(params_1, stderr=subprocess.PIPE, stdout=w, bufsize=4096)
sub_src2.communicate()

We open the sink (the reader of the pipe) first and then communicate with the source processes only to avoid potential blocking as mentioned by @Ariel. This also forces the first source process to complete and flush its output over the pipe, before the second source process gets a chance to write to the pipe, preventing interleaved/clobbered output. You can play with the bufsize value to tweak performance.

This is pretty much exactly what the shell command is doing.

sirlark
  • 2,187
  • 2
  • 18
  • 28
  • the explanation is misleading. (1) `bufsize` has no effect on the performance in this case. In principle, (due to `stderr=PIPE`) `bufsize` may affect how `stderr` is read (though it does not in this case because `.communicate()` on posix does not use `stderr.read()` -- it uses `select` instead) . `bufsize` has no effect on stdin, stdout because they are not assigned to PIPE. (2) If you drop `stderr=PIPE` then `.communicate()` call is unnecessary `.wait()` can be used as in my answer. (3) You should close unused pipe ends in the parent (after you've passed them to the subprocesses) – jfs Feb 28 '15 at 12:36