My computer is Windows 7 using the last Python version (3.6.4)
I've got a freeze (or hangup) in Python with a simple regular expression.
For didatic reasons, I've cut the excess and the code is:
import re
p = r'^\s*if(?:[^)]+(?!\)\s+[a-z]))*[^)]+\)\s+[a-z]' # raw string
c = re.compile(p,re.I) # It's OK. I've tried it.
s = r'if ( slots[pass[i]].indexOf(passRight[i])==-1 ) // ahk comment'
r = c.search(s) # regular expression running. Hangup
print(r) # Never arrive here
The regular expression p
skips the spaces, then IF
, so it skips non )
chars until (using look ahead feature) occur a ')' followed by spaces and a letter. That situation does not occur in the two )
in the string s
, so it never match that situation, that is invalid in AutoHotKey
language.
So the right running it will return no match
However, I have an ugly infinite loop.
I repeat the same test in interactive mode using `IDLE'. It's the same!
It's very weird.
Can somebody help me?
UPDATE
Below the perfect answer from Wiktor Stribiżew. It's place like a comment but should be here as an answer. It saves my day!:
"Remove + After [^)] or the regex engine will try too many paths to match the same text that does not finally match, causing a catastrophic backtracking. The (?:[^)]+(?!\)\s+[a-z]))*
is just a pattern of the (a+)+
type that will cause CB when used in the middle of the pattern. See more examples here and this one, too."
The last link from Wiktor comment is particularly elucidative. Below patt
is a simple regex, and it works for small strings. However, for large string it's painful.
patt = r'A(B|C+)+D'
The right way. No ambiguity. No inner and outer repeater:
patt = r'A(B|C)+D'
In this case, the fault is somewhat obvious. My above example is a little bit more subtle, because non )
char is much more common than ')', even so, the internal +
is still useless.
So, it's just cut the inner +
. If [^)]+
changes to [^)]
and everything works great:
p = r'^\s*if(?:[^)](?!\)\s+[a-z]))*[^)]+\)\s+[a-z]' # raw string
I will try to be smarter next time.