Short answer
If you call compiled_pattern.search(text)
directly, it won't call _compile
at all, it will be faster than re.search(pattern, text)
and much faster than re.search(compiled_pattern, text)
.
This performance difference is due to KeyError
s in cache and slow hash calculations for compiled patterns.
re
functions and SRE_Pattern
methods
Any time a re
function with a pattern
as 1st argument (e.g. re.search(pattern, string)
or re.findall(pattern, string)
) is called, Python tries to compile the pattern
first with _compile
and then calls the corresponding method on the compiled pattern. For example:
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
Note that pattern
could be either a string or an already compiled pattern (an SRE_Pattern
instance).
_compile
Here's a compact version of _compile
. I simply removed debug and flags check:
_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512
def _compile(pattern, flags):
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if len(_cache) >= _MAXCACHE:
_cache.clear()
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
_compile
with String pattern
When _compile
is called with a string pattern, the compiled pattern is saved in _cache
dict. Next time the same function is called (e.g. during the many timeit
runs), _compile
simply checks in _cache
if this string has already been seen and returns the corresponding compiled pattern.
Using ipdb
debugger in Spyder, it's easy to dive into re.py
during execution.
import re
pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
'tempor incididunt ut labore et dolore magna aliqua.'
compiled_pattern = re.compile(pattern)
re.search(pattern, text)
re.search(pattern, text)
With a breakpoint at the second re.search(pattern, text)
, it can be seen that the pair:
{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}
is saved in _cache
. The compiled pattern is returned directly.
_compile
with compiled pattern
slow hash
What happens if _compile
is called with an already compiled pattern?
First, _compile
checks if the pattern is in _cache
. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:
In [1]: import re
In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"
In [3]: compiled_pattern = re.compile(pattern)
In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
hash(compiled_pattern)
is 60 times slower than hash(pattern)
here.
KeyError
When a pattern
is unknown, _cache[type(pattern), pattern, flags]
fails with a KeyError
.
The KeyError
gets handled and ignored. Only then does _compile
check if the pattern is already compiled. If it is, it gets returned, without being written in cache.
It means that the next time _compile
is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a KeyError
.
Error handling is expensive, and I suppose that's the main reason why re.search(compiled_pattern, text)
is slower than re.search(pattern, text)
.
This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if _compile
is called with an already compiled pattern.