0

My computer is Windows 7 using the last Python version (3.6.4)

I've got a freeze (or hangup) in Python with a simple regular expression.

For didatic reasons, I've cut the excess and the code is:

import re
p = r'^\s*if(?:[^)]+(?!\)\s+[a-z]))*[^)]+\)\s+[a-z]'  # raw string
c = re.compile(p,re.I)   # It's OK. I've tried it. 
s = r'if ( slots[pass[i]].indexOf(passRight[i])==-1 )  // ahk comment'
r = c.search(s)  # regular expression running. Hangup
print(r)  # Never arrive here

The regular expression p skips the spaces, then IF, so it skips non ) chars until (using look ahead feature) occur a ')' followed by spaces and a letter. That situation does not occur in the two ) in the string s, so it never match that situation, that is invalid in AutoHotKey language.

So the right running it will return no match

However, I have an ugly infinite loop.

I repeat the same test in interactive mode using `IDLE'. It's the same!

It's very weird.

Can somebody help me?

UPDATE

Below the perfect answer from Wiktor Stribiżew. It's place like a comment but should be here as an answer. It saves my day!:

"Remove + After [^)] or the regex engine will try too many paths to match the same text that does not finally match, causing a catastrophic backtracking. The (?:[^)]+(?!\)\s+[a-z]))* is just a pattern of the (a+)+ type that will cause CB when used in the middle of the pattern. See more examples here and this one, too."

The last link from Wiktor comment is particularly elucidative. Below patt is a simple regex, and it works for small strings. However, for large string it's painful.

patt = r'A(B|C+)+D'

The right way. No ambiguity. No inner and outer repeater:

patt = r'A(B|C)+D'

In this case, the fault is somewhat obvious. My above example is a little bit more subtle, because non ) char is much more common than ')', even so, the internal + is still useless.

So, it's just cut the inner +. If [^)]+ changes to [^)] and everything works great:

p = r'^\s*if(?:[^)](?!\)\s+[a-z]))*[^)]+\)\s+[a-z]'  # raw string

I will try to be smarter next time.

Paulo Buchsbaum
  • 2,471
  • 26
  • 29
  • 3
    what are you trying to find exactly, what is the expected output – depperm Mar 16 '18 at 18:03
  • 3
    It's prolly because of your catastrophic backtracking here `(?:[^)]+(?!\)\s+[a-z]))*`. You have multiple repetitions within a group that can be repeated – Federico Piazza Mar 16 '18 at 18:05
  • 6
    See https://www.regular-expressions.info/catastrophic.html for details – depperm Mar 16 '18 at 18:05
  • I will see the reference. I'm writing an utility for seach in a file for any undiscovered syntax errors. In this case, it's forbiden the syntax "If (expressão) command" in the same line. The command always start with a letter. – Paulo Buchsbaum Mar 16 '18 at 18:28
  • Most lines run without any problem. That is line 271 of my source code. – Paulo Buchsbaum Mar 16 '18 at 18:29
  • NLA = (?!\)\s+[a-z])) - this negative look ahead pattern matches a ")" + spaces + letter. ([^)]+ NLA )* - 1 or more non ")" chars + NLA repeat 0 or more times. In the end, I mach the syntax error: Sequence Non ")" + spaces + letter – Paulo Buchsbaum Mar 16 '18 at 18:34
  • There is no look behind and the internal nested expression uses "+", so it always advances if there are some partial match. No loop is possible. The external expression uses "*", but no problem because "0"matches indicates that after 1 or more non ")" chars, the search finds a ")" + spaces + letter, there is an undetected syntax error in AutoHotKey – Paulo Buchsbaum Mar 16 '18 at 18:38
  • 3
    [Remove `+` after `[^)]`](https://regex101.com/r/YhbrNs/1), or the regex engine will try too many paths to match the same text that does not finally match, causing a catastrophic backtracking. The `(?:[^)]+(?!\)\s+[a-z]))*` is just a pattern of the `(a+)+` type that will cause CB when used in the middle of the pattern. See more examples [here](https://community.appway.com/screen/kb/article/checking-strings-avoiding-catastrophic-backtracking). And [this one, too](https://snyk.io/blog/redos-and-catastrophic-backtracking/). – Wiktor Stribiżew Mar 16 '18 at 18:49
  • Are you absolutely right. When I remove "+" as you suggest, it works as expected. I will be more careful next time. – Paulo Buchsbaum Mar 16 '18 at 20:58
  • I've updated my question. The problem was solved. I understand that it's not an infinite loop but the old regex had too many possibilities to run in a realistic period. Now my 3.500 source code lines run smoothly – Paulo Buchsbaum Mar 16 '18 at 21:08

0 Answers0