I have a huge pipeline written in Python that uses very large .gz files (~14GB compressed), but need a better way to send certain lines to an external software (formatdb from blast-legacy/2.2.26). I have a Perl script someone wrote for me a long time ago that does this extremely fast, but I need to do that same thing in Python given that the rest of the pipeline is written in Python and I have to keep it that way. The Perl script uses two file handles, one to hold zcat of .gz file and the other to store the lines the software needs (2 of every 4) and use it as the input. It involves bioinformatics, but no experience is needed. The file is in fastq format and the software needs it in fasta format. Every 4 lines is a fastq record, take the 1st and 3rd line and add '>' to the beginning of the 1st line and that is the fasta equivalent the formatdb software will use for each record.
The perl script is as follows:
#!/usr/bin/perl
my $SRG = $ARGV[0]; # reads.fastq.gz
open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n";
# -i: input -n: db name -p: program
open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n";
#Fastq => Fasta sub
my $localcounter = 0;
while (my $line = <$fh>){
if ($. % 4==1){
print $fh2 "\>" . substr($line, 1);
$localcounter++;
}
elsif ($localcounter == 1){
print $fh2 "$line";
$localcounter = 0;
}
else{
}
}
close $fh;
close $fh2;
exit;
It works really well. How could I do this same thing in Python? I like how Perl can use those file handles, but I'm not sure how to do that in Python without creating an actual file. All I can think of is to gzip.open the file and write the two lines of every record I need to a new file and use that with "formatdb", but it is way too slow. Any ideas? I need to work it into the python pipeline, so I can't just rely on the perl script and I would also like to know how to do this in general. I assume I need to use some form of the subprocess module.
Here is my Python code, but again it is way to slow and speed is the issue here (HUGE FILES):
#!/usr/bin/env python
import gzip
from Bio import SeqIO # can recognize fasta/fastq records
import subprocess as sp
import os,sys
filename = sys.argv[1] # reads.fastq.gz
tempFile = filename + ".temp.fasta"
outFile = open(tempFile, "w")
handle = gzip.open(filename, "r")
# parses each fastq record
# r.id and r.seq are the 1st and 3rd lines of each record
for r in SeqIO.parse(handle, "fastq"):
outFile.write(">" + str(r.id) + "\n")
outFile.write(str(r.seq) + "\n")
handle.close()
outFile.close()
cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F '
sp.call(cmd, shell=True)
cmd = 'rm ' + tempFile
sp.call(cmd, shell=True)