My problem is about parsing log files and removing variable parts on each line in order to group them. For instance:
s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)
I have about 120+ matching rules like the above.
I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes.
The exact same behavior happens when replacing my rules with
for a in range(100):
s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)
It got 20 times slower when using range(101) instead.
# range(100)
% ./dashlog.py file.bz2
== Took 2.1 seconds. ==
# range(101)
% ./dashlog.py file.bz2
== Took 47.6 seconds. ==
Why is such a thing happening? And is there any known workaround ?
(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)