0

I have seen threads about Python not accepting regular expressions with more than 100 groups. I have a similar but (I think?) different issue.

I have a python script that I wrote for work that I use to look for "interesting" things in large (10-100MB) complex log files. I have a list of strings where each string is used as a regular expression in a call to re.search. Each regular expression is run one-by-one on each line of logging. When my list has 100 or less regular expression strings, everything runs peachy. But when I add one more to the list, the script runs so slowly that it is useless.

I wrote a short script that simulates the problem. Can someone please help me determine why this is happening and offer some suggestions regarding what I can do about it?

Here is some example output from the script below. The logs I run my script on are rarely less than 250,000 lines long and are often 1-2 million lines.

$ ./regexTest.py 100 10000
Running regex search
Done regex search
Time Elapsed: 0.734515 seconds

$ ./regexTest.py 101 10000
Running regex search
Done regex search
Time Elapsed: 44.428968 seconds

Thank you all for your time!


#!/usr/bin/python
#------------------------------------------------------------------------------
# File: regexTest.py
# Description: Simulate the problem where trying to use a list of more than
#              100 regular expression strings slows the script to uselessness
# Example:
#   ./regexTest.py 100 10000
#   ./regexTest.py 101 10000
#------------------------------------------------------------------------------

from random import randint
import re
import sys
import time

def main():
    if (len(sys.argv) != 3):
        print "Usage: %s <NUM REGEXES> <LOG FILE LENGTH>" % sys.argv[0]
        sys.exit(1)

    numRegexes    = int(sys.argv[1])
    logFileLength = int(sys.argv[2])

    regexes = []

    # generate random regular expressions comprised of between 5 and 20
    # lowercase letters
    for i in range(numRegexes):
        randomLetters = [chr(randint(97,120)) for x in range(randint(5,20))]
        baseRegex     = "".join(randomLetters)
        regex         = baseRegex + str(i)

        regexes.append(regex)

    # simulate the contents of a log file
    data = []
    for i in range(logFileLength):
        randomLetters = [chr(randint(97,120)) for x in range(randint(20,100))]
        logLine       = "".join(randomLetters)

        data.append(logLine)

    print "Running regex search"
    startTime = time.time()

    for line in data:
        for regex in regexes:
            ret = re.search(regex, line)
            if (ret is not None):
                print "Regex found"

    print "Done regex search"
    endTime = time.time()

    print "Time Elapsed: %f seconds" % (endTime - startTime)

if (__name__ == "__main__"):
    main()
retsigam
  • 540
  • 1
  • 5
  • 13

1 Answers1

2

You are using regexes wrong; you should compile them. There is a cache in re module for the lazy people who do not compile regexes; the size of the cache is 100 on Python 2.7, 512 on Python 3.4. The size of the cache is available at re._MAXCACHE. After these 100 entries are exhausted, all compiled patterns are expelled from the cache. If you see your program using more than re._MAXCACHE regexes, you should precompile them using re.compile.

regexes = [ re.compile(i) for i in regexes ]
for line in data:
    for regex in regexes:
        ret = regex.search(line)
        if ret is not None:
            print("Regex found")

Furthermore, if you really are not using any of these matches, you should combine them into one. Thus you could try to join them with the '|' bar operator that make them alternative to each other:

megaregex = re.compile('|'.join(regexes))
for line in data:
    ret = megaregex.search(line)
    if ret is not None:
        print("Regex found")
  • The nature of my script is that each regex maps a convoluted log message into something human-readable. Thus, each regex is associated with a string. Otherwise, I would try to "OR" them all together. Thank you for the idea about compiling them. I wasn't aware that there was a cache limitation so I never understood why anyone would compile the regular expressions since it added extra seemingly-unnecessary work. – retsigam May 28 '14 at 17:09