4

Handling a regex with java.util.regex leads to a stack overflow for large strings and complicated regex (especially with | in it).

Is there a way to handle regular expressions more defensively in Java, like

  • having a non-recursive mode for regex
  • throwing a catchable exception for those regular expressions (instead of a java.lang.Stackoverflow error),

or any other mechanism that does not kill my program or put it in an unrecoverable state?

J Fabian Meier
  • 33,516
  • 10
  • 64
  • 142
  • 3
    I advise you to try simplifying the regex instead of finding a solution for the stack overflow error. – Maroun Jun 07 '16 at 08:16
  • 1
    True, this is often sensible. But it is actually annoying to come up with a regular expression, let it pass all the unit tests, put it in a large analysis program to see it crash at 2am because some input string was really large. – J Fabian Meier Jun 07 '16 at 08:20
  • This answer has been answered several times. Implement some timeout mechanism: http://stackoverflow.com/questions/910740/cancelling-a-long-running-regex-match – Wiktor Stribiżew Jun 07 '16 at 08:22
  • @WiktorStribiżew: True, this is a reasonable approach, but also unsatisfying because I have to guess a time limit depending on my stack size. My programs are usually running at night to do some kind of job. I do not care whether a regex takes 5 minutes as long as it does not crash the JVM. – J Fabian Meier Jun 07 '16 at 08:27
  • What are you trying to do in your program? May be you are seeing the problem from a different angle. – aksappy Jun 07 '16 at 08:31
  • Solutiontwo: only use regexps that are writtdn acc. to unroll the loop principle, linearly, where each preceding subpattern cannot match the same character as the subsequent subpattern. – Wiktor Stribiżew Jun 07 '16 at 08:42

1 Answers1

0

The StackOverflowError can be caught and handled just as any exception. Errors signals serious problems that you normally should not catch, but in this case you know what it is and you need to handle it. Just catch it and handle the situation (or re-throw a custom exception).

You might also want to consider using the -Xss command line flag to increase your stack size.

Per Huss
  • 4,755
  • 12
  • 29
  • I agree with second part. First part seems a bit sketchy.. handling a run time error is working around actual issue. It looks to me that the user should valid input and rethink the expressions based on requirement if that is causing run time errors. – ring bearer Jun 07 '16 at 09:29
  • Yes, @ringbearer, I agree with you, best is to _avoid_ the problem, at least if it can be done at a reasonable cost. In this case, I want to offer a way to prevent damage, which could be a starting point, while waiting for a better solution (which may or may not come)... – Per Huss Jun 07 '16 at 09:38
  • The only correct solution: *only use regexps that are written acc. to unroll the loop principle, linearly, where each preceding subpattern cannot match the same character as the subsequent subpattern.* – Wiktor Stribiżew Jun 07 '16 at 10:06