I have a gzipped data file containing a million lines:
$ zcat million_lines.txt.gz | head
1
2
3
4
5
6
7
8
9
10
...
My Perl script which processes this file is as follows:
# read_million.pl
use strict;
my $file = "million_lines.txt.gz" ;
open MILLION, "gzip -cdfq $file |";
while ( <MILLION> ) {
chomp $_;
if ($_ eq "1000000" ) {
print "This is the millionth line: Perl\n";
last;
}
}
In Python:
# read_million.py
import gzip
filename = 'million_lines.txt.gz'
fh = gzip.open(filename)
for line in fh:
line = line.strip()
if line == '1000000':
print "This is the millionth line: Python"
break
For whatever reason, the Python script takes almost ~8x longer:
$ time perl read_million.pl ; time python read_million.py
This is the millionth line: Perl
real 0m0.329s
user 0m0.165s
sys 0m0.019s
This is the millionth line: Python
real 0m2.663s
user 0m2.154s
sys 0m0.074s
I tried profiling both scripts, but there really isn't much code to profile. The Python script spends most of its time on for line in fh
; the Perl script spends most of its time in if($_ eq "1000000")
.
Now, I know that Perl and Python have some expected differences. For instance, in Perl, I open up the filehandle using a subproc to UNIX gzip
command; in Python, I use the gzip
library.
What can I do to speed up the Python implementation of this script (even if I never reach the Perl performance)? Perhaps the gzip
module in Python is slow (or perhaps I'm using it in a bad way); is there a better solution?
EDIT #1
Here's what the read_million.py
line-by-line profiling looks like.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 @profile
3 def main():
4
5 1 1 1.0 0.0 filename = 'million_lines.txt.gz'
6 1 472 472.0 0.0 fh = gzip.open(filename)
7 1000000 5507042 5.5 84.3 for line in fh:
8 1000000 582653 0.6 8.9 line = line.strip()
9 1000000 443565 0.4 6.8 if line == '1000000':
10 1 25 25.0 0.0 print "This is the millionth line: Python"
11 1 0 0.0 0.0 break
EDIT #2:
I have now also tried subprocess
python module as per @Kirk Strauser, and others. It is faster:
Python "subproc" solution:
# read_million_subproc.py
import subprocess
filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout:
line = line.strip()
if line == '1000000':
print "This is the millionth line: Python"
break
gzip.wait()
Here is a comparative table of all the things I've tried so far:
method average_running_time (s)
--------------------------------------------------
read_million.py 2.708
read_million_subproc.py 0.850
read_million.pl 0.393