I have seen threads about Python not accepting regular expressions with more than 100 groups. I have a similar but (I think?) different issue.
I have a python script that I wrote for work that I use to look for "interesting" things in large (10-100MB) complex log files. I have a list of strings where each string is used as a regular expression in a call to re.search. Each regular expression is run one-by-one on each line of logging. When my list has 100 or less regular expression strings, everything runs peachy. But when I add one more to the list, the script runs so slowly that it is useless.
I wrote a short script that simulates the problem. Can someone please help me determine why this is happening and offer some suggestions regarding what I can do about it?
Here is some example output from the script below. The logs I run my script on are rarely less than 250,000 lines long and are often 1-2 million lines.
$ ./regexTest.py 100 10000
Running regex search
Done regex search
Time Elapsed: 0.734515 seconds
$ ./regexTest.py 101 10000
Running regex search
Done regex search
Time Elapsed: 44.428968 seconds
Thank you all for your time!
#!/usr/bin/python
#------------------------------------------------------------------------------
# File: regexTest.py
# Description: Simulate the problem where trying to use a list of more than
# 100 regular expression strings slows the script to uselessness
# Example:
# ./regexTest.py 100 10000
# ./regexTest.py 101 10000
#------------------------------------------------------------------------------
from random import randint
import re
import sys
import time
def main():
if (len(sys.argv) != 3):
print "Usage: %s <NUM REGEXES> <LOG FILE LENGTH>" % sys.argv[0]
sys.exit(1)
numRegexes = int(sys.argv[1])
logFileLength = int(sys.argv[2])
regexes = []
# generate random regular expressions comprised of between 5 and 20
# lowercase letters
for i in range(numRegexes):
randomLetters = [chr(randint(97,120)) for x in range(randint(5,20))]
baseRegex = "".join(randomLetters)
regex = baseRegex + str(i)
regexes.append(regex)
# simulate the contents of a log file
data = []
for i in range(logFileLength):
randomLetters = [chr(randint(97,120)) for x in range(randint(20,100))]
logLine = "".join(randomLetters)
data.append(logLine)
print "Running regex search"
startTime = time.time()
for line in data:
for regex in regexes:
ret = re.search(regex, line)
if (ret is not None):
print "Regex found"
print "Done regex search"
endTime = time.time()
print "Time Elapsed: %f seconds" % (endTime - startTime)
if (__name__ == "__main__"):
main()