2

I am the author of the pythonizer perl to python converter. I'm trying to bootstrap the latest version of it, and it's hanging in the re module. I ran it with -mtrace -t and it froze solid here:

Perlscan.py(449):     if py == "'.pl'":
Perlscan.py(454):     term = "'"
Perlscan.py(456):     print('py: ', len(py), py)
py:  140 '\U0001f60e\U0001f941\U0001f3b9\U0001f3a4\U0001f3b8\U0001f49c\U0001f55b\U0001f385\U0001f3fb\u274c\u2b55\ufe0f\U0001f4af\U0001f1fa\U0001f1f8'
Perlscan.py(457):     if _m := re.search(
Perlscan.py(458):         r"""^(f?(?:'''|""\"|'|")(?:[A-Za-z]:)?)((?:(?:[\\/])?(?:[{][^}]+[}])*|[A-Za-z0-9_.-]*)*)[.]pl\b(.*)$""", py
Perlscan.py(457):     if _m := re.search(
 --- modulename: re, funcname: search
re.py(200):     return _compile(pattern, flags).search(string)
 --- modulename: re, funcname: _compile
re.py(290):     if isinstance(flags, RegexFlag):
re.py(292):     try:
re.py(293):         return _cache[type(pattern), pattern, flags]
Terminated

The "Terminated" message is me killing the stuck process using a kill command. Of course, I can't seem to reproduce this issue in a small program. Here is the snippet of source code that causes the program to freeze:

print('py: ', len(py), py)
if _m := re.search(
    r"""^(f?(?:'''|""\"|'|")(?:[A-Za-z]:)?)((?:(?:[\\/])?(?:[{][^}]+[}])*|[A-Za-z0-9_.-]*)*)[.]pl\b(.*)$""", py
):

I'm using python 3.10.

AKX
  • 152,115
  • 15
  • 115
  • 172
snoopyjc
  • 621
  • 4
  • 11
  • 4
    You have catastrophic backtracking. It’s kind of hard to explain concisely what the hazards are, but look that up. `(?:(?:[\\/])?(?:[{][^}]+[}])*|[A-Za-z0-9_.-]*)*` has too many layers of `*`, for one; just write `(?:(?:[\\/])?(?:[{][^}]+[}])|[A-Za-z0-9_.-])*`. May or may not be the only fix required. The unnecessary `(?:)`-wrapping of single characters/classes makes it harder to read. – Ry- Oct 10 '22 at 04:56
  • try this regex on https://regex101.com/ with the debugger enabled and try to see where is the step that induced the catastrphic backtracking. Try to rewrite the regex using other assumption and to simplify it – robob Oct 10 '22 at 05:13
  • Hmmmm. It works just fine in perl. How do I unbuffer the trace output to see where it’s really getting stuck? This regex is supposed to match a python string starting with a full pathname (on Unix or windows) that points to a perl script. How can I rewrite it so it doesn’t loop? – snoopyjc Oct 10 '22 at 09:31

1 Answers1

3

The catastrophic backtracking here can be observed on a string as simple as f'''c:\folder-name\file name here.pl.

Patterns with indefinitely quantified groupings that contain optional indefinitely quantified patterns inside tend to cause this type of regex issue.

Here, (?:(?:[\\/])?(?:[{][^}]+[}])*|[A-Za-z0-9_.-]*)* is the faulty pattern that can be generalized as (?:a?b*|c*)*. Here, both the alternative can match indefinitely quantified patterns, that - when placed in the middle of a regex - cause the regex engine to try so many ways to match the input string that it takes forever to compute.

The way out in this situation is to make sure you use patterns that match at least one obligatory pattern by removing * quantifiers, e.g. (?:{[^}]+}|[A-Za-z0-9_\\/.-])*.

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks - not sure why it works just fine in perl and also why I can't reproduce it in a small python program, but I'm changing the regex to this in the perl version and it will pythonize to a new working python version! Thanks again!! – snoopyjc Oct 10 '22 at 15:03