Let's say I have a function which searches for multiple patterns in a string using regexes:
import re
def get_patterns(string):
"""
Takes a string and returns found groups
of numeric and alphabetic characters.
"""
re_digits = re.compile("(\d+)")
re_alpha = re.compile("(?i)([A-Z]+)")
digits = re_digits.findall(string)
alpha = re_alpha.findall(string)
return digits, alpha
get_patterns("99 bottles of beer on the wall")
(['99'], ['bottles', 'of', 'beer', 'on', 'the', 'wall'])
Now suppose this function is going to be called hundreds of thousands of times, and that it's not such a trivial example. Does it a) matter whether the regex compilation is being done within the function, i.e. is there an efficiency cost to calling the compile operation at each function call (or is it reused from cache?) and b) if there is, what would be a recommended approach for avoiding that overhead?
One method would be to pass the function a list of compiled regex objects:
re_digits = re.compile("(\d+)")
re_alpha = re.compile("(?i)([A-Z]+)")
def get_patterns(string, [re_digits, re_alpha]):
but I dislike how such an approach dissociates the regexes from the dependent function.
UPDATE: As per Jens' recommendation I've run a quick timing check and doing the compiling within the function's default arguments is indeed quite a bit (~30%) faster:
def get_patterns_defaults(string,
re_digits=re.compile("(\d+)"),
re_alpha=re.compile("(?i)([A-Z]+)")
):
"""
Takes a string and returns found groups
of numeric and alphabetic characters.
"""
digits = re_digits.findall(string)
alpha = re_alpha.findall(string)
return digits, alpha
from timeit import Timer
test_string = "99 bottles of beer on the wall"
t = Timer(lambda: get_patterns(test_string))
t2 = Timer(lambda: get_patterns_defaults(test_string))
print t.timeit(number=100000) # compiled in function body
print t2.timeit(number=100000) # compiled in args
0.629958152771
0.474529981613