2

I compiled the following pattern

pattern = re.compile(
    r"""
    (?P<date>.*?)
    \s*
    (?P<thread_id>\w+)
    \s*PACKET\s*
    (?P<identifier>\w+)
    \s*
    (?P<proto>\w+)
    \s*
    (?P<indicator>\w+)
    \s*
    (?P<ip>\d+\.\d+\.\d+\.\d+)
    \s*
    (?P<xid>\w+)
    \s*
    (?P<q_r>.*?)
    \s*\[
    (?P<flag_hex>[0-9]*)
    \s*
    (?P<flag_char_code>.*?)
    \s*
    (?P<status>\w+)
    \]\s*
    (?P<record>\w+)
    \s*
    \.(?P<domain>.*)\.
    """, re.VERBOSE
    )

to work with this string

2/1/2014 9:34:29 PM 05EC PACKET 00000000025E97A0 UDP Snd 10.10.10.10 ebbe R Q [8381 DR NXDOMAIN] A (1)9(1)a(3)c-0(11)19-330ff801(7)e0400b1(4)15e0(4)1ca7(4)2f4a(3)210(1)0(26)841f75qnhp97z6jknf946qwfm5(4)avts(6)domain(3)com(0)

And it successfully works

In [4]: pattern.findall(re.sub('\(\d+\)', '.', x))
Out[4]: 
[('2/1/2014 9:34:29 PM',
  '05EC',
  '00000000025E97A0',
  'UDP',
  'Snd',
  '10.10.10.10',
  'ebbe',
  'R Q',
  '8381',
  'DR',
  'NXDOMAIN',
  'A',
  '9.a.c-0.19-330ff801.e0400b1.15e0.1ca7.2f4a.210.0.841f75qnhp97z6jknf946qwfm5.avts.domain.com')]

The issue is that it takes so long in some cases, any idea how to enhance the pattern for consuming time.

Andy
  • 49,085
  • 60
  • 166
  • 233
  • 3
    Define "long"? Can you give us an example of input string which the regex takes too long to match? As a side note, I probably wouldn't use regexes to solve that problem -- it looks like a good use case for the good old `split()` method of string objects. – Max Noel Mar 05 '14 at 18:21
  • Which method are you using? If you are not using `re.match()`, the first thing I'd do is add a `^` start of string/line anchor at the beginning. Also, @JDB is right about the: `\s*(?.*?)\s*` construct possibly _running away_. – ridgerunner Mar 05 '14 at 18:47
  • Its good to split everything, that you can split, and then search re in parts, which you can't split. And logic would be less complicated. And it will work faster in many cases.. – akaRem Mar 05 '14 at 19:38

2 Answers2

6

Yep, you've got yourself a case of catastrophic backtracking, also known as an "evil regex", here:

\s*
(?P<q_r>.*?)
\s*

Here:

\s*
(?P<flag_char_code>.*?)
\s*

And here:

\s*
\.(?P<domain>.*)\.

Replacing .* with \S* should do the trick.

For more information about what an evil regex is and why it's evil, check out this question:
How can I recognize an evil regex?

Community
  • 1
  • 1
JDB
  • 25,172
  • 5
  • 72
  • 123
1

You can improve your pattern with:

(?P<domain>\w+(?:[-.]\w+)*)
(?P<date>\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} [AP]M)
(?P<q_r>[^[]*)

You need a more explicit subpattern for flag_char_code too, the goal is to describe the content of each group to reduce the regex engine work and avoid backtracking.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125