Consider this Python code:
import timeit
import re
def one():
any(s in mystring for s in ('foo', 'bar', 'hello'))
r = re.compile('(foo|bar|hello)')
def two():
r.search(mystring)
mystring="hello"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
mystring="goodbye"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string.
What I get here (Python 3.2.3) is this output:
[0.36678314208984375, 0.03450202941894531]
[0.6672089099884033, 3.7519450187683105]
In the first case, the regular expression easily defeats the any
expression - the regular expression finds the substring immediately, while the any
has to check the whole string a couple of times before it gets to the correct substring.
But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any
expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?