1

I wrote a simple script in python which is supposed to scan a file line by line and match a couple different regular expression to reformat the data. It works something like this:

with open(file) as f:
    for line in f:
        line = line.rstrip('\n')
        parseA(line, anOutPutFile) or parseB(line, anOutPutFile) or parseC(line, anOutPutFile) or parseD(line, anOutPutFile)

each line can be one of the A,B,C,D lines or none (most of them match A, second most common is B, etc..) and here is an example parseX function:

def parseA(line, anOutPutFile):
    regex = '.*-' + bla + ' A ' + '.*\((\d+)\) (\d+) (\w+) (\d+)@(.+) .* (.*)' #(etc..)
    m = re.match(regex, line)
    if m:
        out = 'A' + ',' + m.group(1) + ',' + m.group(2) + ',' + ... #etc
        anOutPutFile.write(out)
        return True
    else:
        return False

I was hoping that the short circuiting of the 'or' operator would help but the script is still incredibly slow on large files (For example, files of size ~1G) and I was wondering if there was anything obvious and simple I can start amending in it that is very inefficient. For example re.compile (but the docs say that recent regexps are cached and I only have a handful)?

Thanks

BASED ON COMMENTS BELOW

I changed the code first to use join and then to use re.compile and neither seems to have sped this up. It's running on a test file that's 50,000 lines and taking about 93 seconds give or take 1 second. This is also what it was taking before on this test file. I have anywhere from 8 to 12 groups in each regular expression and there are 5 of them. This is what I changed the code into:

regexA = re.compile('.*0' + bla + ' A ' + '.*\((\d+)\) (\d+) (\w+) (\d+)@(.+) .* (.*) .* .* foo=far fox=(.*) test .*')
regexB = re.compile(#similar)
regexC = re.compile('.*0' + bla + ' C ' + '.*\((\d+)\) (\d+) (\w+) (\d+)@(.+) foo=(\d+) foo2=(\d+) foo3=(\d+)@(.+) (\w+) .* (.*) .* .* foo4=val foo5=(.*) val2 .*')
regexD = re.compile(#similar)
regexE = re.compile(#similar)

#include two of the regex above fully to get an idea of what they look like
#now this is an example of one of the parse funcs for regexA

def parseA(line,anOutputFile):
    m = regexA.match(line)
    if m:
        out = ''.join(['A',',',m.group(1),',',m.group(2),',',#etc])
        anOutputFile.write(out)
        return True
    else:
        return False

perhaps the join with the list is not what you meant? And compiling the 5 regexps once top level did not help.

Palace Chan
  • 8,845
  • 11
  • 41
  • 93
  • have you tried compiling your regex first for parseX? See beginning of section 7.2.2 (http://docs.python.org/2/library/re.html) – prgao Sep 25 '13 at 20:23
  • I have not because it said recently used ones are cached and I only have a handful but I'm trying that already anyways to make sure. – Palace Chan Sep 25 '13 at 20:24
  • It might be a lot faster to just, e.g., `mmap` the file and then `re.finditer` all the A matches, then all the B matches, and so on, instead of calling `re.match` 100M times. It would be a little trickier if you need to interleave the lines in the correct order, but still doable. – abarnert Sep 25 '13 at 20:27
  • 2
    Trying to optimize code without profiling it is usually pointless, but one thing leaps out at me - don't use direct string concatenation to reconstruct your modified output line. Use `''.join` or string interpolation instead. That's one of the better known performance quirks of Python - string concatenation builds an intermediary string for each `+`. – Peter DeGlopper Sep 25 '13 at 20:27
  • Yea I usually profile but I figure some things that stand out like the above might make a big enough difference. Thanks ill try that as well – Palace Chan Sep 25 '13 at 20:28
  • @PeterDeGlopper: That _was_ one of the better-known performance quirks of the CPython implementation before 2.6. But most people are using 2.6 or later nowadays. And for a handful of short strings, it's hardly likely to matter. So, I think that advice is both out of date and a red herring. Of course `join` will usually be faster, and we won't actually know if it's a bottleneck until the OP profiles, but I wouldn't count on this. – abarnert Sep 25 '13 at 20:31
  • 2
    Another possibility: instead of 4 regexps, use a single regexp that's an alternation between the 4, and distinguish which one matched by grouping. (On top of possibly being faster on its own, it would also solve the ordering problem with `mmap`+`finditer`, if it's a problem.) – abarnert Sep 25 '13 at 20:33
  • @abarnert - true, I don't have massively high hopes for that, but I am not confident that the OP's number of strings is low. That `#etc` comment speaks implied volumes to me. Profiling should turn up whether it matters at all - the python.org `test code` claims roughly a 20% speedup using `join`, which might be helpful here. Maybe. – Peter DeGlopper Sep 25 '13 at 20:46
  • https://wiki.python.org/moin/ConcatenationTestCode for the test code I'm referring to. You're right that I learned that back on earlier versions where it mattered a lot more. – Peter DeGlopper Sep 25 '13 at 20:52
  • @PeterDeGlopper: By "large number" we're talking many dozens. And if `#etc` stands for a slew of explicit `+ ', ' + m.group(147)` clauses all on one line without any copy-pasta bugs that make performance irrelevant because the code isn't even correct, I'll eat my hat. – abarnert Sep 25 '13 at 20:53
  • @PeterDeGlopper: Yeah, that's on 2.4, where string concatenation was quadratic rather than linear. Which is irrelevant if you're using CPython 2.6 or later. You're right that `join` is better, but if only because ', '.join([m.group(i) for i in range(214)]) is actually readable (and visibly correct)… – abarnert Sep 25 '13 at 20:54
  • I made the changes (join and re.compile) both separately and together and neither seems to have sped me up much. Edits are above. For reference, I am using python 2.6.4 – Palace Chan Sep 25 '13 at 20:59
  • Your regex itself may be slow. Using lots of greedy placeholders in sequence (e.g., `.* .* foo .*`) can cause considerable backtracking. You might try changing your quantifiers to nongreedy `.*?` and `.+?`. – BrenBarn Sep 25 '13 at 21:06
  • can you share at least a piece of the file for proper testing? – planestepper Sep 25 '13 at 21:13
  • Wow BrenBarn I think the greedy consecutive quantifiers change made a massive difference! It just ran in about 1 second! I find it hard to believe. – Palace Chan Sep 25 '13 at 21:14
  • Without knowing your data I can't say if this will work for you. If you need to use the greedy quantifiers, you might not need to use them on every line, or for every possible match. Sometimes you can speed things up by processing in stages. Then you can bypass the heavy lifting if its not necessary. – James Robinson Sep 25 '13 at 21:31
  • I was able to get rid of greedy quantifiers and be more specific about them (replacing, for example, .* with \d{3}:\d{2}-\d{7} or something like that). Unfortunately these files are on a remote server and I can't seem to scp them locally – Palace Chan Sep 25 '13 at 21:33

2 Answers2

0

Have you thought about moving the file writing out of the function and simplifying as they seem to be all the same.

A parse function that takes the line and a compiled regex as an argument.

def parse(line, regex, label):
    m = regex.match(line)
    if m:
        return ','.join([label] + m.groups())

Then the file writing can happen in the main loop

with open(file) as f:
    for line in f:
        line = line.rstrip('\n')
        outline = parse(line, regexA, 'A') or parse(line, regexB, 'B') or parse(line, regexC, 'C') or parse(line, regexD, 'D')
        if outline:
            anOutPutFile.write(outline)
Graeme Stuart
  • 5,837
  • 2
  • 26
  • 46
-1

having all your parseA, parseB calls or'ed to each other without an if will still execute each one of them. Put an if at the beginning of the line and then just do pass if necessary to prevent executing next parseX function if the previous succeeds, or create a tree of ifs.

Ricardo Villamil
  • 5,031
  • 2
  • 30
  • 26