16

My problem is about parsing log files and removing variable parts on each line in order to group them. For instance:

s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)

I have about 120+ matching rules like the above.

I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes.

The exact same behavior happens when replacing my rules with

for a in range(100):
    s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)

It got 20 times slower when using range(101) instead.

# range(100)
% ./dashlog.py file.bz2
== Took  2.1 seconds.  ==

# range(101)
% ./dashlog.py file.bz2
== Took  47.6 seconds.  ==

Why is such a thing happening? And is there any known workaround ?

(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)

Wiil
  • 629
  • 3
  • 8
  • 22

1 Answers1

29

Python keeps an internal cache for compiled regular expressions. Whenever you use one of the top-level functions that takes a regular expression, Python first compiles that expression, and the result of that compilation is cached.

Guess how many items the cache can hold?

>>> import re
>>> re._MAXCACHE
100

The moment you exceed the cache size, Python 2 clears all cached expressions and starts with a clean cache. Python 3 increased the limit to 512 but still does a full clear.

The work-around is for you to cache the compilation yourself:

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')

compiled_expression.sub(r"User .. is ", s)

You could use functools.partial() to bundle the sub() call together with the replacement expression:

from functools import partial

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')
ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ")

then later on use ready_to_use_sub(s) to use the compiled regular expression pattern together with a specific replacement pattern.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Sweet! Especially since (the probably too dirty) `re._MAXCACHE = 200` is working flawlessly. I will go for my own re compilation cache in the end (And seems I should have looked at the re.py code straight away). Thanks a lot. – Wiil Jun 27 '13 at 17:27
  • 1
    Yeah, the cache implementation in `re` is rather primitive, and setting `re._MAXCACHE` should be safe for now, if not the best idea from a portability perspective. Attempts to improve the caching story have been made, but the latest attempt only led to slowdowns, see [Why are uncompiled, repeatedly used regexes so much slower in Python 3?](http://stackoverflow.com/q/14756790) – Martijn Pieters Jun 27 '13 at 17:31