1

I give to user a possibility to execute regex over a lot of texts on my Java server. It's implemented with Pattern and Matcher Java classes.

In some cases it leads to StackOverflowError. It happens for complex regexps and/or when a lot of matches occures.

Pattern pattern = Pattern.compile(term);
Matcher matcher = pattern.matcher(text);


    java.lang.StackOverflowError: null
        at java.util.regex.Pattern$Branch.match(Pattern.java:4604) ~[na:1.8.0_45-internal]
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658) ~[na:1.8.0_45-internal]
...

The question is how can I limit users to use only a small subset of regexps to avoid such errors and unwanted high server's CPU usage. Can it be done with pure Java or any third-party Java library?

Or maybe there is other possible solutions, i.e. just try/catch StackOverflowError?

moon light
  • 23
  • 3
  • Alongwith what has been suggested one other option can be a simple caching layer. Not for the results but for the regex that cause StackOverflow and the users that are writing those regex. ConcurrentHashMap containing Regex as the key and userId as value should be enough. And then if possible go talk with the users and tell them not to run those regex. Otherwise at least check the Map before running the regex again and waiting for a StackOverflow. Can help with CPU load. – Aseem Bansal Dec 06 '15 at 10:22

1 Answers1

3

Try/catch StackOverflowError, plus wrapping the whole thing in an aggressive timeout (say, 1 second), is almost certainly your best bet. It's also by far the simplest option. As you develop your implementation you'll probably need to catch other exception types as well.

Note that, for the timeout to work, you will need to use an interruptible CharSequence implementation rather than a plain String. I've used this stategy successfully with poorly-written, regex-heavy third party libraries before.

The Q&A linked above should help get you started: Cancelling a long running regex match?


The original approach you suggested – to try to detect "bad" patterns up-front – is a very hard problem to solve indeed. Are you familiar with the halting problem? It's only a little less hard than solving that (which is impossible to solve in the general case).

Community
  • 1
  • 1
Matt Ball
  • 354,903
  • 100
  • 647
  • 710
  • Thanks, Matt. I will likely stop on a try/catch approach. But it doesn't protect the server from possible high CPU usage when many users will execute complex regex at the same time. And another thing which I doubt, is that it is not recommended to catch `Error`. – moon light Nov 26 '15 at 21:07
  • 1
    Letting users run regexes on your server makes for a crappy user experience, too. Nobody gets regexes right away, and it takes years to master them. Do everyone a favor and implement a custom search facility, tailored to the job at hand. – Alan Moore Nov 26 '15 at 21:47