7

I have a huge pipeline written in Python that uses very large .gz files (~14GB compressed), but need a better way to send certain lines to an external software (formatdb from blast-legacy/2.2.26). I have a Perl script someone wrote for me a long time ago that does this extremely fast, but I need to do that same thing in Python given that the rest of the pipeline is written in Python and I have to keep it that way. The Perl script uses two file handles, one to hold zcat of .gz file and the other to store the lines the software needs (2 of every 4) and use it as the input. It involves bioinformatics, but no experience is needed. The file is in fastq format and the software needs it in fasta format. Every 4 lines is a fastq record, take the 1st and 3rd line and add '>' to the beginning of the 1st line and that is the fasta equivalent the formatdb software will use for each record.

The perl script is as follows:

#!/usr/bin/perl 
my $SRG = $ARGV[0]; # reads.fastq.gz

open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n";

# -i: input -n: db name -p: program 
open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n";

#Fastq => Fasta sub
my $localcounter = 0;
while (my $line = <$fh>){
        if ($. % 4==1){
                print $fh2 "\>" . substr($line, 1);
                $localcounter++;
        }
        elsif ($localcounter == 1){
                print $fh2 "$line";
                $localcounter = 0;
        }
        else{
        }
}
close $fh;
close $fh2;
exit;

It works really well. How could I do this same thing in Python? I like how Perl can use those file handles, but I'm not sure how to do that in Python without creating an actual file. All I can think of is to gzip.open the file and write the two lines of every record I need to a new file and use that with "formatdb", but it is way too slow. Any ideas? I need to work it into the python pipeline, so I can't just rely on the perl script and I would also like to know how to do this in general. I assume I need to use some form of the subprocess module.

Here is my Python code, but again it is way to slow and speed is the issue here (HUGE FILES):

#!/usr/bin/env python

import gzip
from Bio import SeqIO # can recognize fasta/fastq records
import subprocess as sp
import os,sys

filename = sys.argv[1] # reads.fastq.gz

tempFile = filename + ".temp.fasta"

outFile = open(tempFile, "w")

handle = gzip.open(filename, "r")
# parses each fastq record
# r.id and r.seq are the 1st and 3rd lines of each record
for r in SeqIO.parse(handle, "fastq"):
    outFile.write(">" + str(r.id) + "\n")
    outFile.write(str(r.seq) + "\n")

handle.close()
outFile.close()

    cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F '
    sp.call(cmd, shell=True)

    cmd = 'rm ' + tempFile
    sp.call(cmd, shell=True)
Julian Egger
  • 71
  • 1
  • 3
  • You want every second line from the file written to a temp file? – Padraic Cunningham May 05 '15 at 21:14
  • Optimizing your new program is really a separate question from getting a file handle for a subprocess; you should probably create it as a separate question. But if you read the `subprocess` docs (or my answer), it should be pretty clear how you can pipe data in to `formatdb -i stdin` instead of writing it to a file first and then running `formatdb` on the file. – abarnert May 05 '15 at 21:21

2 Answers2

12

First, there's a much better solution in both Perl and Python: just use a gzip library. In Python, there's one in the stdlib; in Perl, you can find one on CPAN. For example:

with gzip.open(path, 'r', encoding='utf-8') as f:
    for line in f:
        do_stuff(line)

Much simpler, and more efficient, and more portable, than shelling out to zcat.


But if you really do want to launch a subprocess and control its pipes in Python, you can do it with the subprocess module. And, unlike perl, Python can do this without having to stick a shell in the middle. There's even a nice section in the docs on Replacing Older Functions with the subprocess Module that gives you recipes.

So:

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)

Now, zcat.stdout is a file-like object, with the usual read methods and so on, wrapping the pipe to the zcat subprocess.

So, for example, to read a binary file 8K at a time in Python 3.x:

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
for chunk in iter(functools.partial(zcat.stdout.read, 8192), b''):
    do_stuff(chunk)
zcat.wait()

(If you want to do this in Python 2.x, or read a text file one line at a time instead of a binary file 8K at a time, or whatever, the changes are the same as they'd be for any other file-handling coding.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • @PadraicCunningham [`partial`](https://docs.python.org/3/library/functools.html#functools.partial): `functools.partial(zcat.read, 8192)` is effectively the same as `lambda: zcat.read(8192)`, but possibly more efficient and definitely more introspectable. – abarnert May 05 '15 at 20:59
  • @abarnet, I thought it was I just did not see any import, do you need iter? – Padraic Cunningham May 05 '15 at 21:01
  • I am not really sure if I necessarily "want" to launch a subprocess, I just need it to go faster than how I am doing it with Python, although I thought subprocess might be an option. I know that Perl script above works really nice as I have used it plenty of times, so it must be doing something right. The gzip.open and parse seems too slow, but maybe it just needs tweaking? Still not sure how I can then take the two lines I need, make a slight alteration, and use it as the stdin for the software as if it was just its own file (without creating a file). – Julian Egger May 05 '15 at 21:05
  • @PadraicCunningham: Well, I didn't include the `import subprocess` either. This isn't meant to be a complete program, just enough to show the OP how to do what he wants. Meanwhile, of course you don't really _need_ `iter` (although you do need `.stdout.` in there, which I missed… fixed…), you could just do a `while True:`, `buf = zcat.stdout.read(8192)`, `if not buf: break`, etc., but why be more verbose? Especially since the point isn't the particular code, it's just that any file-reading code works on `zcat.stdout`, just as any perl file-reading code works on his `$fh`. – abarnert May 05 '15 at 21:16
  • @JulianEgger: You can pass a pipe into the stdin of another process, and then `p2.stdin.write`. Or you can just pass a file-like object as the stdin. Or you can [pipe two processes together without getting in the way](https://docs.python.org/3/library/subprocess.html#replacing-shell-pipeline). There's really nothing complicated here. – abarnert May 05 '15 at 21:17
  • @JulianEgger: And as for performance, we obviously can't help you optimize your code if we can't see your code, but in my experience, using the `gzip` module in Python is actually slightly _faster_ than using `zcat`, not slower. Of course if you do something silly like try to `read()` or `readlines()` the whole file into memory, that's going to give you 20 minutes of swap hell, which isn't going to do your performance any favors, but just don't do that. – abarnert May 05 '15 at 21:19
  • @abarnert, nvm, I just interpreted what you were doing incorrectly. – Padraic Cunningham May 05 '15 at 21:44
  • In Python7.3, the argument `encoding` does not work for me. I apply it to the line object as `line.decode('utf-8')` – aerijman Nov 06 '20 at 16:20
0

You can parse the whole file and load it as a list of lines using this function:

    def convert_gz_to_list_of_lines(filepath):
     """Parse gz file and convert it into a list of lines."""
     file_as_list = list()
     with gzip.open(filepath, 'rt', encoding='utf-8') as f:
      try:
       for line in f:
        file_as_list.append(line)
      except EOFError:
        file_as_list = file_as_list
      return file_as_list
Fermata
  • 1
  • 1