15

I have a function that returns True if a string matches at least one regular expression in a list and False otherwise. The function is called often enough that performance is an issue.

When running it through cProfile, the function is spending about 65% of its time doing matches and 35% of its time iterating over the list.

I would think there would be a way to use map() or something but I can't think of a way to have it stop iterating after it finds a match.

Is there a way to make the function faster while still having it return upon finding the first match?

def matches_pattern(str, patterns):
    for pattern in patterns:
        if pattern.match(str):
            return True
    return False
puffenstuff
  • 205
  • 3
  • 6

4 Answers4

24

The first thing that comes to mind is pushing the loop to the C side by using a generator expression:

def matches_pattern(s, patterns):
    return any(p.match(s) for p in patterns)

Probably you don't even need a separate function for that.

Another thing you should try out is to build a single, composite regex using the | alternation operator, so that the engine has a chance to optimize it for you. You can also create the regex dynamically from a list of string patterns, if this is necessary:

def matches_pattern(s, patterns):
    return re.match('|'.join('(?:%s)' % p for p in patterns), s)

Of course you need to have your regexes in string form for that to work. Just profile both of these and check which one is faster :)

You might also want to have a look at a general tip for debugging regular expressions in Python. This can also help to find opportunities to optimize.

UPDATE: I was curious and wrote a little benchmark:

import timeit

setup = """
import re
patterns = [".*abc", "123.*", "ab.*", "foo.*bar", "11010.*", "1[^o]*"]*10
strings = ["asdabc", "123awd2", "abasdae23", "fooasdabar", "111", "11010100101", "xxxx", "eeeeee", "dddddddddddddd", "ffffff"]*10
compiled_patterns = list(map(re.compile, patterns))

def matches_pattern(str, patterns):
    for pattern in patterns:
        if pattern.match(str):
            return True
    return False

def test0():
    for s in strings:
        matches_pattern(s, compiled_patterns)

def test1():
    for s in strings:
        any(p.match(s) for p in compiled_patterns)

def test2():
    for s in strings:
        re.match('|'.join('(?:%s)' % p for p in patterns), s)

def test3():
    r = re.compile('|'.join('(?:%s)' % p for p in patterns))
    for s in strings:
        r.match(s)
"""

import sys
print(timeit.timeit("test0()", setup=setup, number=1000))
print(timeit.timeit("test1()", setup=setup, number=1000))
print(timeit.timeit("test2()", setup=setup, number=1000))
print(timeit.timeit("test3()", setup=setup, number=1000))

The output on my machine:

1.4120500087738037
1.662621021270752
4.729579925537109
0.1489570140838623

So any doesn't seem to be faster than your original approach. Building up a regex dynamically also isn't really fast. But if you can manage to build up a regex upfront and use it several times, this might result in better performance. You can also adapt this benchmark to test some other options :)

Community
  • 1
  • 1
Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • 1
    `'|'.join(patterns)` is not safe. Regex is a complex language. You want to wrap each term in `(?:...)` first to be safe. – Karl Knechtel May 14 '12 at 21:44
  • 1
    @Karl: Very true, thanks for pointing it out. It's still not completely safe, though if there are back-references involved, I think. That has to be considered on a case-by-case basis. – Niklas B. May 14 '12 at 21:45
  • Thanks a bunch. I will try out building the regex up front. Also I can't believe I forgot about `any`. – puffenstuff May 15 '12 at 01:08
8

The way to do this fastest is to combine all the regexes into one with "|" between them, then make one regex match call. Also, you'll want to compile it once to be sure you're avoiding repeated regex compilation.

For example:

def matches_pattern(s, pats):
    pat = "|".join("(%s)" % p for p in pats)
    return bool(re.match(pat, s))

This is for pats as strings, not compiled patterns. If you really only have compiled regexes, then:

def matches_pattern(s, pats):
    pat = "|".join("(%s)" % p.pattern for p in pats)
    return bool(re.match(pat, s))
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • You might be able to further optimize them by analyzing them for overlaps and reducing the number of total patterns. – Silas Ray May 14 '12 at 21:09
2

Adding to the excellent answers above, make sure you compare the output of re.match with None:

>>> timeit('None is None')
0.03676295280456543
>>> timeit('bool(None)')
0.1125330924987793
>>> timeit('re.match("a","abc") is None', 'import re')
1.0200879573822021
>>> timeit('bool(re.match("a","abc"))', 'import re')
1.134294033050537
Anuj Gupta
  • 10,056
  • 3
  • 28
  • 32
0

It's not exactly what the OP asked, but this worked well for me as an alternative to long iterative matching.

Here is some example data and code:

import random
import time
mylonglist = [ ''.join([ random.choice("ABCDE") for i in range(50)]) for j in range(3000) ]

# check uniqueness
print "uniqueness:"
print len(mylonglist) == len(set(mylonglist))

# subsample 1000
subsamp = [ mylonglist[x] for x in random.sample(xrange(3000),1000) ]
# join long string for matching
string = " ".join(subsamp)

# test function 1
def by_string_match(string, mylonglist):
    counter = 0
    t1 = time.time()
    for i in mylonglist:
        if i in string:
            counter += 1
    t2 = time.time()
    print "It took {} seconds to find {} items".format(t2-t1,counter)

# test function 2
def by_iterative_match(subsamp, mylonglist):
    counter = 0
    t1 = time.time()
    for i in mylonglist:
        if any([ i in s for s in subsamp ]):
            counter += 1
    t2 = time.time()
    print "It took {} seconds to find {} items".format(t2-t1,counter)

# test 1:
print "string match:"
by_string_match(string, mylonglist)
# test 2:
print "iterative match:"
by_iterative_match(subsamp, mylonglist)
ssanch
  • 389
  • 2
  • 6