3

Let's say I have a function which searches for multiple patterns in a string using regexes:

import re
def get_patterns(string):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    re_digits = re.compile("(\d+)")
    re_alpha = re.compile("(?i)([A-Z]+)")
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

get_patterns("99 bottles of beer on the wall")
(['99'], ['bottles', 'of', 'beer', 'on', 'the', 'wall'])

Now suppose this function is going to be called hundreds of thousands of times, and that it's not such a trivial example. Does it a) matter whether the regex compilation is being done within the function, i.e. is there an efficiency cost to calling the compile operation at each function call (or is it reused from cache?) and b) if there is, what would be a recommended approach for avoiding that overhead?

One method would be to pass the function a list of compiled regex objects:

re_digits = re.compile("(\d+)")
re_alpha = re.compile("(?i)([A-Z]+)")
def get_patterns(string, [re_digits, re_alpha]):

but I dislike how such an approach dissociates the regexes from the dependent function.

UPDATE: As per Jens' recommendation I've run a quick timing check and doing the compiling within the function's default arguments is indeed quite a bit (~30%) faster:

def get_patterns_defaults(string, 
                          re_digits=re.compile("(\d+)"), 
                          re_alpha=re.compile("(?i)([A-Z]+)")
                          ):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

from timeit import Timer
test_string = "99 bottles of beer on the wall"
t = Timer(lambda: get_patterns(test_string))
t2 = Timer(lambda: get_patterns_defaults(test_string))
print t.timeit(number=100000)  # compiled in function body
print t2.timeit(number=100000)  # compiled in args
0.629958152771
0.474529981613
glarue
  • 530
  • 7
  • 20
  • Did you mean to return a tuple `(digits, alpha)`? – Jens Jun 06 '15 at 23:49
  • @Jens would the desired return type affect the approach? I just gave a toy example, but if that matters I'd be curious to know! – glarue Jun 06 '15 at 23:57
  • a) off course, you must avoid to recompile several times your regex. b) I suggest you an other approach with a single pattern: `(?=[^\W_])(?:(\d+)|([A-Za-z]+))` and you use the group numbers (advantage: the string is parsed only once.) – Casimir et Hippolyte Jun 07 '15 at 00:04
  • 1
    @glarue To return multiple values from a function call, you have to stuff them into a single instance, a tuple, a list, a dictionary. See [here](http://stackoverflow.com/questions/354883/how-do-you-return-multiple-values-in-python). My gut says that a tuple is the most compact performance-wise. Having said that, your return statement implicitly creates a tuple anyway. – Jens Jun 07 '15 at 00:05
  • possible duplicate of [Compiling a regex inside a function that's called multiple times](http://stackoverflow.com/questions/3427329/compiling-a-regex-inside-a-function-thats-called-multiple-times) – glarue Jul 09 '15 at 14:30

3 Answers3

5

One solution is to use default arguments, so they are compiled only once:

import re
def get_patterns(string, re_digits=re.compile("(\d+)"), re_alpha=re.compile("(?i)([A-Z]+)")):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

Now you can call it:

get_patterns(string)
nicolas
  • 3,120
  • 2
  • 15
  • 17
  • 1
    That's an interesting way of having static values within a function. Thanks for the interesting trick. – bjfletcher Jun 06 '15 at 23:55
  • @seb great, I think this covers the bases as far as keeping the patterns associated with the function using them and avoiding globals. I also can't believe I didn't think of it. Thank you! – glarue Jun 06 '15 at 23:58
  • I'd think that the default values are being re-compiled every time the function itself has to be re-compiled, correct? – Jens Jun 07 '15 at 00:01
  • @Jens based on the timeit info I'm assuming they only need to be compiled once at the initial runtime compile, and are thereafter reused. – glarue Jun 07 '15 at 00:19
  • Default arguments are evaluated once, when the python interpreter reaches the line with `def`. This happens only once. The output is an object that is attached to the function object. – nicolas Jun 07 '15 at 00:20
  • You can even directly bind the method to a more descriptive name, for instance `find_all_digits=re.compile("(\d+)").findall`. – Aristide May 19 '20 at 18:18
1

You can use Python's timeit (or here and here) functionality to measure execution time.

If you want to avoid repeatedly compiling these regexes, try initializing them as globals:

import re

_re_digits = re.compile("(\d+)")
_re_alpha = re.compile("(?i)([A-Z]+)")

def get_patterns(string): 
    digits = _re_digits.findall(string)
    alpha = _re_alpha.findall(string)
    return (digits, alpha)
Community
  • 1
  • 1
Jens
  • 8,423
  • 9
  • 58
  • 78
1

Fun fact: you can set and get attributes on functions in Python, just like any other object. So another solution that avoids globals and only compiles the regexes once would be something like this:

def get_patterns(string):
    f = get_patterns
    return f.digits.findall(string), f.alpha.findall(string)

get_patterns.digits = re.compile("(\d+)")
get_patterns.alpha = re.compile("(?i)([A-Z]+)")

Another solution would be to use a closure:

def make_finder(*regexps):
    return lambda s: tuple(r.findall(s) for r in regexps)

get_patterns = make_finder(re.compile("(\d+)"), re.compile("(?i)([A-Z]+)"))
jangler
  • 949
  • 6
  • 7