1

I need a regex that checks is it a quoted (by ') string with possible escaped \' inside. So, I come up with the following regex, \'(\\.|[^\'])*\'.

"""\'(\\.|[^\'])*\'""".r.findFirstIn(s"'${"a"*100}'")

which works perfectly on small strings, but fails with stack overflow on size > 3000 bytes.

"""\'(\\.|[^\'])*\'""".r.findFirstIn(s"'${"a"*5000}'")

This is Scala snippets. Internally it runs java.util.regex, so it's java/jvm problem.

In my knowledge, those simple regex should not cause stack overflow, it's a simple DFA/NFA without any recursion inside.

How to workaround this issue?

I need regex for that (this is part of parser-combinator code, I can not just write custom code that checks the property).

Why there is recursion inside?

Pavel Ajtkulov
  • 515
  • 7
  • 21
  • 1
    Does [this](https://stackoverflow.com/questions/50710043/regex-for-multiline-string-literals-produces-stackoverflowerror) contain an answer to your question? ... Or maybe one should answer this one separately, escaping with backslashes seems more common than triple-double-quotes. – Andrey Tyukin Dec 03 '18 at 22:35
  • More specifically: does `"""\'(\\.|[^\'])*+\'""".r.findFirstIn(s"'${"a"*5000}'")` solve your problem? – Andrey Tyukin Dec 03 '18 at 22:43
  • thanks. It works, going to dig into links. A bit more sophisticated than I thought – Pavel Ajtkulov Dec 03 '18 at 22:59
  • If that works you can wrap the whole thing inside an atomic group `\'(?>\\.|[^\'])*\'` The issue is going to be the alternations that cause the stack problem If you use the unrolled-loop version it gets rid of the alternation. –  Dec 03 '18 at 23:01

2 Answers2

2

You can try the classical Unrolling the Loop technique outlined by J. Friedl:

'                              # the start delimiter
 ([^\\']*                      # anything but the end of the string or the escape char
         (?:\\.                #     the escape char preceding an escaped char (any char)
               [^\\']*         #     anything but the end of the string or the escape char
                      )*)      #     repeat
                             ' # the end delimiter

Regex101 Demo

wp78de
  • 18,207
  • 7
  • 43
  • 71
0

It might be related to RegEx DOS.

Java uses the traditional NFA algorithm [1] to support features such as lazy, backtracking and backreference. NFA 'eats in' a character each time and tries to match it with regexp, and 'spits' it out if it does not match. It will keep spitting until it can find another match (similar to deep first search), and thus bad expressions might cause the RegEx engine to encounter a RegEx DOS, and specifically in Java, it will finally cause a stack overflow for long strings.

According to OWASP, evil regexp expressions contain: Evil Regex pattern contains:

  • Grouping with repetition (1)
  • Inside the repeated group:
    • Repetition
    • Alternation with overlapping (2)

After a brief examination of your regexp expression, it seems that you have (1) and (2) since you have ()* (repetition) and \\.|[^\'] (overlapping), thus I believe you may have to restructure your RegEx expression to avoid RegEx DOS.

RZ PAN
  • 76
  • 5