Python re module becomes 20 times slower when looping on more than 100 different regex

Question

My problem is about parsing log files and removing variable parts on each line in order to group them. For instance:

s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)

I have about 120+ matching rules like the above.

I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes.

The exact same behavior happens when replacing my rules with

for a in range(100):
    s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)

It got 20 times slower when using range(101) instead.

# range(100)
% ./dashlog.py file.bz2
== Took  2.1 seconds.  ==

# range(101)
% ./dashlog.py file.bz2
== Took  47.6 seconds.  ==

Why is such a thing happening? And is there any known workaround ?

(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)

Martijn Pieters · Accepted Answer · 2013-06-26T16:20:23.363

Python keeps an internal cache for compiled regular expressions. Whenever you use one of the top-level functions that takes a regular expression, Python first compiles that expression, and the result of that compilation is cached.

Guess how many items the cache can hold?

>>> import re
>>> re._MAXCACHE
100

The moment you exceed the cache size, Python 2 clears all cached expressions and starts with a clean cache. Python 3 increased the limit to 512 but still does a full clear.

The work-around is for you to cache the compilation yourself:

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')

compiled_expression.sub(r"User .. is ", s)

You could use functools.partial() to bundle the sub() call together with the replacement expression:

from functools import partial

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')
ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ")

then later on use ready_to_use_sub(s) to use the compiled regular expression pattern together with a specific replacement pattern.

Sweet! Especially since (the probably too dirty) `re._MAXCACHE = 200` is working flawlessly. I will go for my own re compilation cache in the end (And seems I should have looked at the re.py code straight away). Thanks a lot. — Wiil, Jun 27 '13 at 17:27
Yeah, the cache implementation in `re` is rather primitive, and setting `re._MAXCACHE` should be safe for now, if not the best idea from a portability perspective. Attempts to improve the caching story have been made, but the latest attempt only led to slowdowns, see [Why are uncompiled, repeatedly used regexes so much slower in Python 3?](http://stackoverflow.com/q/14756790) — Martijn Pieters, Jun 27 '13 at 17:31

Python re module becomes 20 times slower when looping on more than 100 different regex

1 Answers1

Linked

Related