I wrote a simple script in python which is supposed to scan a file line by line and match a couple different regular expression to reformat the data. It works something like this:
with open(file) as f:
for line in f:
line = line.rstrip('\n')
parseA(line, anOutPutFile) or parseB(line, anOutPutFile) or parseC(line, anOutPutFile) or parseD(line, anOutPutFile)
each line can be one of the A,B,C,D lines or none (most of them match A, second most common is B, etc..) and here is an example parseX function:
def parseA(line, anOutPutFile):
regex = '.*-' + bla + ' A ' + '.*\((\d+)\) (\d+) (\w+) (\d+)@(.+) .* (.*)' #(etc..)
m = re.match(regex, line)
if m:
out = 'A' + ',' + m.group(1) + ',' + m.group(2) + ',' + ... #etc
anOutPutFile.write(out)
return True
else:
return False
I was hoping that the short circuiting of the 'or' operator would help but the script is still incredibly slow on large files (For example, files of size ~1G) and I was wondering if there was anything obvious and simple I can start amending in it that is very inefficient. For example re.compile (but the docs say that recent regexps are cached and I only have a handful)?
Thanks
BASED ON COMMENTS BELOW
I changed the code first to use join and then to use re.compile and neither seems to have sped this up. It's running on a test file that's 50,000 lines and taking about 93 seconds give or take 1 second. This is also what it was taking before on this test file. I have anywhere from 8 to 12 groups in each regular expression and there are 5 of them. This is what I changed the code into:
regexA = re.compile('.*0' + bla + ' A ' + '.*\((\d+)\) (\d+) (\w+) (\d+)@(.+) .* (.*) .* .* foo=far fox=(.*) test .*')
regexB = re.compile(#similar)
regexC = re.compile('.*0' + bla + ' C ' + '.*\((\d+)\) (\d+) (\w+) (\d+)@(.+) foo=(\d+) foo2=(\d+) foo3=(\d+)@(.+) (\w+) .* (.*) .* .* foo4=val foo5=(.*) val2 .*')
regexD = re.compile(#similar)
regexE = re.compile(#similar)
#include two of the regex above fully to get an idea of what they look like
#now this is an example of one of the parse funcs for regexA
def parseA(line,anOutputFile):
m = regexA.match(line)
if m:
out = ''.join(['A',',',m.group(1),',',m.group(2),',',#etc])
anOutputFile.write(out)
return True
else:
return False
perhaps the join with the list is not what you meant? And compiling the 5 regexps once top level did not help.