-1

regexp:

^(\w+,?\s?)+(?=:): hi hey\?$

input:

Aaaaaaaaa, bbbbbbbb, cccccccc, dddddddddd, eeeeeeeee: hi?

output: hangs

code

reg = re.compile('^(\w+,?\s?)+(?=:): hi hey\?$')
print reg.search('Aaaaaaaaa, bbbbbbbb, cccccccc, dddddddddd, eeeeeeeee: hi?')

Desired behaviour: find strings with the following pattern

[comma_and_(optionally)space_separated_values][colon][question]

Example of inputs that shouldn't match:

  • : qqq? (no values)
  • aa: qqq? (only one value)
  • aa, bb: qqq ( no question point)
  • aa, : qqq? (bad values format)
  • aa, bb, cc:? (no question)
  • , bb, cc: qqq? (bad values format)

Example of inputs that should match:

  • aa, bb: qq?
  • aa, b, c,d,e,f, g, h: qq?
  • aa, bb, cc: qq ee ff gggg hhhh?
nonsensei
  • 480
  • 4
  • 15
  • 2
    While `(?=:)` lookahead is redundant here, the main issue is that `(\w+,?\s?)+` contains an obligatory `\w+` and the rest is optional. This a typical pattern that leads to catastrophic backtracking. Use [`^\w+(?:,\s*\w+)*:.*\?$`](https://regex101.com/r/EhtNUw/2). – Wiktor Stribiżew May 18 '17 at 15:53
  • I'm not fluent in regex, but [this](http://stackoverflow.com/questions/8316284/why-regex-ismatch-hangs) could possibly be related – Wondercricket May 18 '17 at 15:55
  • 1
    You may use: [`^\w+(?:\s*,\s*\w+)*\s*:\s*([^?]+\?)`](https://regex101.com/r/JqnwnT/2) – anubhava May 18 '17 at 15:55
  • 1
    I reopened this because the answer given in the _dupe_ question is not correct. Has nothing to do with obligatory anything, it's called backtracking. It's too complex. –  May 18 '17 at 16:27

1 Answers1

1

It hangs because of this (\w+)+ scenario.
I.e. too complex on failure.
Works fine if it matches, blows up on failure.

This (\w,?\s?)+ is identical to (\w+,?\s?)+ but won't hang.

So, change it to this ^(\w,?\s?)+(?=:): hi hey\?$ and problem solved.

As a bonus, this ^(\w,?\s?)+: hi hey\?$ is identical.

Also, you can substitute .*?\?$ in place of your literal hi hey\?$
if expected to be variable literal.


Error: Target Operation .. 

The complexity of matching the regular 
expression exceeded predefined bounds.  
Try refactoring the regular expression 
to make each choice made by the state 
machine unambiguous.  This exception is 
thrown to prevent "eternal" matches that
take an indefinite period time to 
locate.

edit:

Note that there is always a potential problem with nested quantifiers.
I.e. those that are greedy and open ended, like (b+)*.

This can almost be cured by removing an inner nest (like b+ in the example).
By making it un-quantified, we can call that a pseudo anchor.

That is, it should be first in the group and is a un-quantified, required character.

This forces the engine on backtrack to go to that character again to check it.
If it is not quantified, it gives up immediately and will not even look at
the rest of the expression.
Thus it goes past that position in the string to find the next literal b.

That is basically what this backtracking cure is all about.

Given the backtrack pitfalls, we can make a solution to get the desired match.

^\s*(\w+\s*(?:[,\s]\s*\w+\s*)+)\s*:\s*([^:]*?\w[^:]*?)\s*\?\s*$

Formatted

 ^                             # BOS
 \s*                           # Wsp trim
 (                             # (1 start), Values - minimum of 2 required
      \w+ \s*                       # First word
      (?: [,\s] \s* \w+ \s* )+      # One or more space or comma seperated
                                    # word value's
 )                             # (1 end)
 \s*                           # Wsp trim
 :                             # Colon
 \s*                           # Wsp trim
 (                             # (2 start), Question -
      [^:]*?                        # Not a colon
      \w                            # At least a word char
      [^:]*?                        # Not a colon
 )                             # (2 end)
 \s*                           # Wsp trim
 \?                            # '?'
 \s*                           # Wsp trim
 $                             # EOS
  • sounds interesting. the original hangs, but with changings there is no Problem anymore – am2 May 18 '17 at 16:54
  • 1
    @am2 - Makes sense. Rarely are explanations about out-of-control backtracking explained correctly. The reason being all regex are mostly different and subtle parts affect each other. You would have gotten the same problem with any of these forms `(\w*)*`, `(\w+)*`, `(\w*)+`, `(\w+)+`. Which are actually common. But put together with other parts, it blows up. The cure (fallback) is to do it like in this answer. Always test (force) for failure on all regex you make. –  May 18 '17 at 17:02
  • Your answer is clear to me. I don't know who downvoted you nor why. I upvoted and accepted – nonsensei May 19 '17 at 09:02
  • 1
    @nonsensei - Why thanks! I've added an extra note on backtracking and a regex solution for you. Note that after each match you could just split capture group 1 with `[\s,]+` to get individual values. –  May 19 '17 at 18:56