10

I have a gzipped data file containing a million lines:

$ zcat million_lines.txt.gz | head
1
2
3
4
5
6
7
8
9
10
...

My Perl script which processes this file is as follows:

# read_million.pl
use strict; 

my $file = "million_lines.txt.gz" ;

open MILLION, "gzip -cdfq $file |";

while ( <MILLION> ) {
    chomp $_; 
    if ($_ eq "1000000" ) {
        print "This is the millionth line: Perl\n"; 
        last; 
    }
}

In Python:

# read_million.py
import gzip

filename = 'million_lines.txt.gz'

fh = gzip.open(filename)

for line in fh:
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break

For whatever reason, the Python script takes almost ~8x longer:

$ time perl read_million.pl ; time python read_million.py
This is the millionth line: Perl

real    0m0.329s
user    0m0.165s
sys     0m0.019s
This is the millionth line: Python

real    0m2.663s
user    0m2.154s
sys     0m0.074s

I tried profiling both scripts, but there really isn't much code to profile. The Python script spends most of its time on for line in fh; the Perl script spends most of its time in if($_ eq "1000000").

Now, I know that Perl and Python have some expected differences. For instance, in Perl, I open up the filehandle using a subproc to UNIX gzip command; in Python, I use the gzip library.

What can I do to speed up the Python implementation of this script (even if I never reach the Perl performance)? Perhaps the gzip module in Python is slow (or perhaps I'm using it in a bad way); is there a better solution?

EDIT #1

Here's what the read_million.py line-by-line profiling looks like.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           @profile
     3                                           def main():
     4
     5         1            1      1.0      0.0         filename = 'million_lines.txt.gz'
     6         1          472    472.0      0.0         fh = gzip.open(filename)
     7   1000000      5507042      5.5     84.3         for line in fh:
     8   1000000       582653      0.6      8.9                 line = line.strip()
     9   1000000       443565      0.4      6.8                 if line == '1000000':
    10         1           25     25.0      0.0                         print "This is the millionth line: Python"
    11         1            0      0.0      0.0                         break

EDIT #2:

I have now also tried subprocess python module as per @Kirk Strauser, and others. It is faster:

Python "subproc" solution:

# read_million_subproc.py 
import subprocess

filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout: 
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break
gzip.wait()

Here is a comparative table of all the things I've tried so far:

method                    average_running_time (s)
--------------------------------------------------
read_million.py           2.708
read_million_subproc.py   0.850
read_million.pl           0.393
Community
  • 1
  • 1
asf107
  • 1,118
  • 7
  • 22
  • 3
    Did you try using a Perl gzip library, or an external gzip pipe in Python? – OrangeDog Apr 11 '16 at 21:40
  • IIRC, Python's `gzip` module is written in Python, so the performance is pretty bad. OrangeDog's suggestion of running `gzip` externally and piping the decompressed output to Python might speed things up. – user2357112 Apr 11 '16 at 21:44
  • 1
    Have a look at https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/ – Markus Apr 11 '16 at 21:47
  • 1
    wow. it is very counter-intuitive to me that "shelling out" to zcat is ideal... – asf107 Apr 11 '16 at 21:51
  • It shouldn't be. Now you have two IO-bound processes not waiting for the same IO: gzip is slurping in the file, and your Perl/Python program is reading and processing it. There's some communication overhead, but Unix is particularly good at shoving stuff through pipes rapidly. The loss there is more than made up by the extra concurrency. – Kirk Strauser Apr 11 '16 at 22:54
  • @asf107: I would always go for a library solution to get the best control over a process, and that is usually the most important criterion. But if a task is *large* and *simple*, and there exists a fast C-based utility that will do it, then invoke that utility. There is usually no need to also invoke a shell process, so *shelling out* to `gzip` isn't crazy like `qx/time/` – Borodin Apr 11 '16 at 23:06
  • Side-note: If you're just waiting for the millionth line, testing the actual contents of the line is rather silly. Just use `$.` in Perl to get the actual line number, or wrap the iterable in Python to do `for lineno, line in enumerate(fh):` so you get the actual line number directly. Or (possibly) better, `for line in itertools.islice(fh, 1000000):` which will terminate the loop after 1M lines without needing to perform a check and `break` in your code at all. – ShadowRanger Apr 12 '16 at 13:36
  • @ShadowRanger agreed. In reality, this is meant to emulate a task that parses and processes every line of a large, gzipped file... this simple example is meant to demonstrate my observation that python doesn't seem to process/parse these large files as well as perl. – asf107 Apr 12 '16 at 13:41
  • 1
    @asf107: The problem being that you weren't really using Perl in your first case. That said, Perl is _designed_ for text processing; if the task is nothing but consuming text and processing it, Perl is likely to win. Heck, in my experience, Perl is somewhat faster in general (though it loses the advantage when you use its hacked together OO features) because of mutable strings (Python is making copies when you `strip` or slice, Perl is mutating in place for `chomp`), and improved name lookups (Perl links up non-OO names at compile time; Python looks them up over and over at runtime). – ShadowRanger Apr 12 '16 at 13:47
  • @asf107: But "Perl is faster" is not much of an argument. If Python is easier to write, provides better standard library features, easier to maintain, etc. (not asserting that this is objectively true, but it's often argued), then Python still "wins"; programmer time usually costs more than CPU time after all. If "But X is faster" was an argument, no one would ever write code in any language higher level than C++. Python (excepting special numeric libraries and other extensions) is for development speed; run speed is secondary to ease of development. – ShadowRanger Apr 12 '16 at 13:49

5 Answers5

7

Having tested a number of possibilities, it looks like the big culprits here are:

  1. Comparing apples to oranges: In your original test case, Perl wasn't doing the file I/O or decompression work, the gzip program was doing so (and it's written in C, so it runs pretty fast); in that version of the code, you're comparing parallel computation to serial computation.
  2. Interpreter startup time; on the vast majority of systems, Python takes substantially longer to begin running (I believe because more files are loaded at startup). Interpreter startup time on my machine is about half the total wall clock time, 30% of the user time, and most of the system time. The actual work done in Python is swamped by start up time, so your benchmark is as much about comparing startup time as comparing time required to do the work. Later addition: You can reduce the overhead from Python startup a bit further by invoking python with the -E switch (to disable checking of PYTHON* environment variables at startup) and the -S switch (to disable automatic import site, which avoids a lot of dynamic sys.path setup/manipulation involving disk I/O at the expense of cutting off access to any non-builtin libraries).
  3. Python's subprocess module is a bit higher level than Perl's open call, and is implemented in Python (on top of lower level primitives). The generalized subprocess code takes longer to load (exacerbating startup time issues) and adds overhead to the process launch itself.
  4. Python 2's subprocess defaults to unbuffered I/O, so you're performing more system calls unless you pass an explicit bufsize argument (4096 to 8192 seems to work fine)
  5. The line.strip() call involves more overhead than you might think; function & method calls are more expensive in Python than they really should be, and line.strip() does not mutate the str in place the way Perl's chomp does (because Python's str is immutable, while Perl strings are mutable)

A couple versions of the code that will bypass most of these problems. First, optimized subprocess:

#!/usr/bin/env python

import subprocess

# Launch with subprocess in list mode (no shell involved) and
# use a meaningful buffer size to minimize system calls
proc = subprocess.Popen(['gzip', '-cdfq', 'million_lines.txt.gz'], stdout=subprocess.PIPE, bufsize=4096)
# Iterate stdout directly
for line in proc.stdout:
    if line == '1000000\n':  # Avoid stripping
        print("This is the millionth line: Python")
        break
# Prevent deadlocks by terminating, not waiting, child process
proc.terminate()

Second, pure Python, mostly built-in (C level) API based code (which eliminates most extraneous startup overhead, and shows that Python's gzip module is not meaningfully distinct from the gzip program), ridiculously microoptimized at the expense of readability/maintainability/brevity/portability:

#!/usr/bin/env python

import os

rpipe, wpipe = os.pipe()

def reader():
    import gzip
    FILE = "million_lines.txt.gz"
    os.close(rpipe)
    with gzip.open(FILE) as inf, os.fdopen(wpipe, 'wb') as outf:
        buf = bytearray(16384)  # Reusable buffer to minimize allocator overhead
        while 1:
            cnt = inf.readinto(buf)
            if not cnt: break
            outf.write(buf[:cnt] if cnt != 16384 else buf)

pid = os.fork()
if not pid:
    try:
        reader()
    finally:
        os._exit()

try:
    os.close(wpipe)
    with os.fdopen(rpipe, 'rb') as f:
        for line in f:
            if line == b'1000000\n':
                print("This is the millionth line: Python")
                break
finally:
    os.kill(pid, 9)

On my local system, on the best of half a dozen runs, the subprocess code takes:

0.173s/0.157s/0.031s wall/user/sys time.

The primitives based Python code with no external utility programs gets that down to a best time of:

0.147s/0.103s/0.013s

(though that was an outlier; a good wall clock time was usually more like 0.165). Adding -E -S to the invocation shaves another 0.01-0.015s wall clock and user time by removing the overhead of setting up the import machinery to handle non-builtins; in other comments, you mention that your Python takes nearly 0.6 seconds to launch doing absolutely nothing (but otherwise seems to perform similarly to mine), which may indicate you've got quite a bit more in the way of non-default packages or environment customization going on, and -E -S may save you more.

The Perl code, unmodified from what you gave me (aside from using 3+ arg open to remove string parsing and storing the pid returned from open to explicitly kill it before exiting) had a best time of:

0.183s/0.216s/0.005s

Regardless, we're talking about trivial differences (the timing jitter from run to run was around 0.025s for wall clock and user time, so Python's wins on wall clock time were mostly insignificant, though it did save on user time meaningfully). Python can win, as can Perl, but non-language related concerns are more important.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
0

Were I a bettor, I'd wager that:

line = line.strip()

is the killer. It's doing a method lookup (that is, resolving line.strip), then calling it to create another object, then assigning the name line to the newly-created object.

Given that you know exactly what your data will look like, I'd see if changing your loop to this would make a difference:

for line in fh: 
    if line == '1000000\n':
        ...

I think I jumped the gun and answered too quickly. I believe you're right: Perl is "cheating" by running gzip in a separate process. Check out Asynchronously read stdout from subprocess.Popen for a way to do the same in Python. It might look like:

import subprocess

filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in iter(gzip.stdout.readline, ''): 
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break
gzip.wait()

And after you do, please report back. I'd like to see the results of this experiment!

Community
  • 1
  • 1
Kirk Strauser
  • 30,189
  • 5
  • 49
  • 65
  • alternatively, outside of loop, `strip_without_lookup = str.split`, and in loop: `line = strip_without_lookup(line)` – Łukasz Rogalski Apr 11 '16 at 21:39
  • I know, I thought so too! However, your proposed suggestion doesn't speed up the script significantly.... i'll post the python profiling info as an edit in my original post – asf107 Apr 11 '16 at 21:39
  • 1
    `for line in iter(gzip.stdout.readline, ''):` is a silly way to reinvent `for line in gzip.stdout:`... Also, I suspect you want to `terminate`/`kill` the process, not `wait` on it; you haven't consumed all of its `stdout` here, so it's going to block when the pipe fills, while you're blocking waiting for it to exit. – ShadowRanger Apr 12 '16 at 03:24
  • good point. i will make the suggested changes and update above – asf107 Apr 12 '16 at 13:15
  • @ShadowRanger Did you see J.F. Sebastian's rationale for using it? – Kirk Strauser Apr 12 '16 at 17:32
  • @KirkStrauser: I hadn't, but having read it, it looks like the main reason to use that approach is to avoid delays when reading from a pipe that is being populated intermittently; `readline` (on Python 2 only; Python 3 fixes the discrepancy) might allow you to get a line immediately where normal iteration might wait until the buffer fills before finding and returning the line. In this case, `gzip` is feeding the pipe faster than we can read and process it anyway, so that isn't a concern. – ShadowRanger Apr 12 '16 at 17:50
  • @ShadowRanger I agree that your conclusion is very likely (although the whole game changes if that file happens to live on a network drive). However, I don't think it's "silly" to process a pipe that way for safety reasons. It costs almost nothing and buys some nice reassurance. – Kirk Strauser Apr 12 '16 at 20:07
  • @KirkStrauser: Nothing changes in the network case; `gzip` is reading the file, not Python, so Python's caching behavior is mostly irrelevant (unless the network is so slow or compression so poor `gzip` can't feed Python fast enough). The silliness is in extra verbosity, obscurity (I know two-arg `iter`, but people who don't will do a double-take), and slowdown (on my local Py2, direct iteration takes around 75-85% less time to run; on Py3, about 40-60% less depending on binary vs. text mode; Py3 suffering only from repeated function call overhead, while Py2 pays the poor buffering costs too). – ShadowRanger Apr 13 '16 at 15:04
0

You made me curious...

The following Python script consistently outperforms the Perl solution on my machine: 3.2s vs. 3.6s for 10,000,000 lines (elapsed real time as given by three runs of time)

import subprocess

filename = 'millions.txt.gz'
gzip = subprocess.Popen(
    ['gzip', '-cdfq', filename],
    bufsize = -1, stdout = subprocess.PIPE)

for line in gzip.stdout:
    if line[:-1] == '10000000':
        print "This is the 10 millionth line: Python"
        break

gzip.wait()

Interestingly, when looking at the time spent in user mode, the Perl solution is slightly better than the Python solution. This seems to indicate that the interprocess communication of the Python solution is more efficient than the one of the Perl solution.

Markus
  • 3,155
  • 2
  • 23
  • 33
0

This one is faster than the Perl version, but it assumes line ending is '\n':

import subprocess

filename = "million_lines.txt.gz"
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout:
    if line == '1000000\n':
        print "This is the millionth line: Python"
        break
gzip.terminate()

Tests

$ time python Test.py 
This is the millionth line: Python

real    0m0.191s
user    0m0.264s
sys     0m0.016s

$ time perl Test.pl 
This is the millionth line: Perl

real    0m0.404s
user    0m0.488s
sys     0m0.008s
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
totoro
  • 2,469
  • 2
  • 19
  • 23
  • 1
    If you pass `universal_newlines=True` as an extra argument to `subprocess.Popen`, then line ending conversions are performed for you (so you can in fact guarantee that the line ending is `'\n'`). This also makes the code portable to Python 3 (where `Popen` returns `bytes`, not `str`, unless you pass `universal_newlines=True`). Alternatively, declare `needle = '1000000' + os.linesep` outside the loop, and test `line == needle` in the loop to match OS line ending expectations. – ShadowRanger Apr 12 '16 at 13:31
  • it is interesting that the above code *still* runs slower on my machine. – asf107 Apr 12 '16 at 13:59
  • 1
    @asf107: At a certain point, most of what you're timing is the overhead of launching the interpreter & performing the actual decompression, not the work done. On my machine, when launched repeatedly with a script that does nothing but print a blank line (and in Python's case, import `subprocess` without using it), both programs `hash`-ed, Perl is just much faster to start. `time` reports `time perl noop.pl` takes about 0.01s/0.002s/0.005s for real/user/sys. `time python noop.py` takes about 0.110s/0.035s/0.040s. The Python noop times are 50%, 25% & 90% of the `read_million` times. – ShadowRanger Apr 12 '16 at 16:11
  • 1
    @asf107: Cont. When the overhead of launching the interpreter at all is a significant fraction of the time spent doing the work, your benchmark is flawed; in a "real" program, startup overhead (as long as it's not noticeable at a human timescale) rarely matters; you'd need to compare the time to do the work alone, or do enough work to swamp launch overhead (and ideally do the work many times in a single session, taking the shortest time to minimize timing jitter). The work here is just too small to matter. – ShadowRanger Apr 12 '16 at 16:14
  • wow.... my average python start-up time is nearly 0.600s! Yikes.... thanks for pointing this out. – asf107 Apr 12 '16 at 18:29
  • @asf107: Given your interpreter seems to otherwise perform similarly to mine, this sounds like you may have some environment customizations or a ton of third party packages installed that add overhead to the launch process. This could easily slow Python startup and make your results non-reproducible on other systems. I added a note to [my answer](http://stackoverflow.com/a/36580221/364696) on this; you might try running your benchmarking code with `-E -S` to disable the extras; it won't make the benchmark _good_, but it will at least be "less bad". – ShadowRanger Apr 13 '16 at 15:32
0

It looks like the next() method of the gzip file, as used in for line in, seems to be very slow - presumably because it's cautiously reading the uncompressed stream looking for line breaks, perhaps in order to keep memory usage under control.

Of course, you're comparing apples to oranges and other people have already made better comparisons between Python forking gunzip and Perl forking gunzip. These presumably work well because they're dumping relatively large uncompressed strings to their stdout in a separate process.

A non-memory safe and potentially wasteful approach is:

import gzip

filename = 'million_lines.txt.gz'

fh = gzip.open(filename)

whole_file = fh.read()
for line in whole_file.splitlines():
    if line == "1000000":
        print "This is the millionth line: Python"
        break

This reads in the entire uncompressed file then splits.

Results:

$ time python test201604121.py
This is the millionth line: Python

real    0m0.183s
user    0m0.133s
sys    0m0.046s


$ time perl test201604121.pl

This is the millionth line: Perl

real    0m0.192s
user    0m0.167s
sys    0m0.027s
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • thanks! i figured that this could sped up a bit by reading the entire file contents into memory; in reality, I usually work with extremely large .gz files where this isn't possible. – asf107 Apr 12 '16 at 16:58