3

How do I improve the performance of this simple piece of python code ? Isn't re.search the best way to find a matching line, since it is almost ~6x slower than Perl or am I doing something wrong ?

#!/usr/bin/env python

import re
import time
import sys

i=0
j=0
time1=time.time()
base_register =r'DramBaseAddress\d+'
for line in  open('rndcfg.cfg'):
    i+=1
    if(re.search(base_register, line)):
        j+=1
time2=time.time()

print (i,j)
print (time2-time1)    
print (sys.version)

This code takes about 0.96 seconds to complete (Average of 10 runs)
Output:

168197 2688
0.8597519397735596
3.3.2 (default, Sep 24 2013, 15:14:17)
[GCC 4.1.1]

while the following Perl code does it in 0.15 seconds.

#!/usr/bin/env perl
use strict;
use warnings;

use Time::HiRes qw(time);

my $i=0;my $j=0;
my $time1=time;
open(my $fp, 'rndcfg.cfg');
while(<$fp>)
{
    $i++;
    if(/DramBaseAddress\d+/)
    {
        $j++;
    }
}
close($fp);
my $time2=time;

printf("%d,%d\n",$i,$j);
printf("%f\n",$time2-$time1);
printf("%s\n",$]);


Output:

168197,2688
0.135579
5.012001

EDIT: Corrected regular expression - Which worsened the performance slightly

Jean
  • 21,665
  • 24
  • 69
  • 119
  • 1
    You should pre-compile the regular expression if you use it several times, or in this case use `in` instead: `if base_register in line:`. – Klaus D. Jun 01 '15 at 16:51
  • 1
    If you read the note to [`re.compile`](https://docs.python.org/3/library/re.html#re.compile) you'll see: "The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions." – Matthias Jun 01 '15 at 16:58

3 Answers3

5

actually, regular expression is less efficient than the string methods in Python. From https://docs.python.org/2/howto/regex.html#use-string-methods:

Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine.

replacing re.search with str.find will give you better runtime. otherwise, using the in operator that others suggested would be optimized, too.

as for the speed difference between the Python & Perl version, i'll just chalk it up to the inherent quality of each language: text processing - python vs perl performance

Community
  • 1
  • 1
oxymor0n
  • 1,089
  • 7
  • 15
  • How can I use `str.find` for a regex ? – Jean Jun 02 '15 at 20:32
  • oh `str.find` is not regex at all; it's a simple substring matching. I just noticed the `\d+` part of your regex, so this might not be totally applicable, but if you only want to count the lines with `"DramBaseAddress"` in them (that is, if the trailing digits don't make a difference), then you can replace `if(re.search(base_register, line)):` with either `if "DramBaseAddress" in line:` or `if line.find("DramBaseAddress") > -1:`. If the trailing digits are important, however, then your best bet is @Veedrac's answer below. – oxymor0n Jun 02 '15 at 21:23
1

In this case you are using a fixed string, not a regular expression.

For regular strings there are faster methods:

>>> timeit.timeit('re.search(regexp, "banana")', setup = "import re;     regexp=r'nan'")
1.2156920433044434
>>> timeit.timeit('"banana".index("nan")')
0.23752403259277344
>>> timeit.timeit('"banana".find("nan")')
0.2411658763885498

Now this kind of text processing is the sweet spot of Perl (aka Practical Extraction and Reporting Language) (aka Pathological Eclectic Rubbish Lister) and has been optimized extensively over the years. All that collective focus adds up.

Peter Tillemans
  • 34,983
  • 11
  • 83
  • 114
1

The overhead of calling re.compile, despite the caching, is massive. Use

is_wanted_line = re.compile(r"DramBaseAddress\d+").search

for i, line in enumerate(open('rndcfg.cfg')):
    if is_wanted_line(line):
        j += 1

instead.

Further, you can do

key = "DramBaseAddress"
is_wanted_line = re.compile(r"DramBaseAddress\d+").search

for i, line in enumerate(open('rndcfg.cfg')):
    if key in line and is_wanted_line(line):
        j += 1

to further reduce overhead.

You can also consider doing your own buffering:

key = b"DramBaseAddress"
is_wanted_line = re.compile(rb"DramBaseAddress\d+").search

with open("rndcfg.cfg", "rb") as file:
    rest = b""

    for chunk in iter(lambda: file.read(32768), b""):
        i += chunk.count(b"\n")
        chunk, _, rest = (rest + chunk).rpartition(b"\n")

        if key in rest and is_wanted_line(chunk):
            j += 1

    if key in rest and is_wanted_line(rest):
        j += 1

which removes the line-splitting and encoding overhead. (This isn't quite the same as it doesn't account for multiple instances per chunk. Such behaviour is relatively simple to add, but may not strictly be needed in your case.)

This is a bit heavyweight, but thrice as fast as the Perl - 8x if you remove i += chunk.count(b"\n")!

Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • do you have any test that validates the claim "The overhead of calling re.compile, despite the caching, is massive."? I use a lot of `re.compile` lol – oxymor0n Jun 02 '15 at 16:09
  • 1
    @oxymor0n I timed it. It was taking almost 2/3 of the time. Note that for one-off uses or searches of long strings, the overhead is likely to be amortized much better. – Veedrac Jun 02 '15 at 17:34
  • interesting, thanks. I always assume the caching performance would be better than that :/ – oxymor0n Jun 02 '15 at 17:39