Python Regex takes so long in some cases

Question

I compiled the following pattern

pattern = re.compile(
    r"""
    (?P<date>.*?)
    \s*
    (?P<thread_id>\w+)
    \s*PACKET\s*
    (?P<identifier>\w+)
    \s*
    (?P<proto>\w+)
    \s*
    (?P<indicator>\w+)
    \s*
    (?P<ip>\d+\.\d+\.\d+\.\d+)
    \s*
    (?P<xid>\w+)
    \s*
    (?P<q_r>.*?)
    \s*\[
    (?P<flag_hex>[0-9]*)
    \s*
    (?P<flag_char_code>.*?)
    \s*
    (?P<status>\w+)
    \]\s*
    (?P<record>\w+)
    \s*
    \.(?P<domain>.*)\.
    """, re.VERBOSE
    )

to work with this string

2/1/2014 9:34:29 PM 05EC PACKET 00000000025E97A0 UDP Snd 10.10.10.10 ebbe R Q [8381 DR NXDOMAIN] A (1)9(1)a(3)c-0(11)19-330ff801(7)e0400b1(4)15e0(4)1ca7(4)2f4a(3)210(1)0(26)841f75qnhp97z6jknf946qwfm5(4)avts(6)domain(3)com(0)

And it successfully works

In [4]: pattern.findall(re.sub('\(\d+\)', '.', x))
Out[4]: 
[('2/1/2014 9:34:29 PM',
  '05EC',
  '00000000025E97A0',
  'UDP',
  'Snd',
  '10.10.10.10',
  'ebbe',
  'R Q',
  '8381',
  'DR',
  'NXDOMAIN',
  'A',
  '9.a.c-0.19-330ff801.e0400b1.15e0.1ca7.2f4a.210.0.841f75qnhp97z6jknf946qwfm5.avts.domain.com')]

The issue is that it takes so long in some cases, any idea how to enhance the pattern for consuming time.

Define "long"? Can you give us an example of input string which the regex takes too long to match? As a side note, I probably wouldn't use regexes to solve that problem -- it looks like a good use case for the good old `split()` method of string objects. — Max Noel, Mar 05 '14 at 18:21
Which method are you using? If you are not using `re.match()`, the first thing I'd do is add a `^` start of string/line anchor at the beginning. Also, @JDB is right about the: `\s*(?.*?)\s*` construct possibly _running away_. — ridgerunner, Mar 05 '14 at 18:47
Its good to split everything, that you can split, and then search re in parts, which you can't split. And logic would be less complicated. And it will work faster in many cases.. — akaRem, Mar 05 '14 at 19:38

score 6 · Accepted Answer · edited May 23 '17 at 10:31

6

Yep, you've got yourself a case of catastrophic backtracking, also known as an "evil regex", here:

\s*
(?P<q_r>.*?)
\s*

Here:

\s*
(?P<flag_char_code>.*?)
\s*

And here:

\s*
\.(?P<domain>.*)\.

Replacing .* with \S* should do the trick.

For more information about what an evil regex is and why it's evil, check out this question:
How can I recognize an evil regex?

edited May 23 '17 at 10:31

Community

1
1

answered Mar 05 '14 at 18:26

JDB

25,172
5
72
123

Casimir et Hippolyte · Answer 2 · 2014-03-05T19:05:09.440

1

You can improve your pattern with:

(?P<domain>\w+(?:[-.]\w+)*)
(?P<date>\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} [AP]M)
(?P<q_r>[^[]*)

You need a more explicit subpattern for flag_char_code too, the goal is to describe the content of each group to reduce the regex engine work and avoid backtracking.

edited Mar 05 '14 at 19:05

answered Mar 05 '14 at 18:46

Casimir et Hippolyte

88,009
5
94
125

Thanks alot, its the first time for me to hear about backtracking in regex. – bbambbam sh2ee2 Mar 05 '14 at 19:56

Python Regex takes so long in some cases

2 Answers2

Linked