1

The following program takes 1.5 seconds to run on my Intel Core i7 machine:

    public static void main(String[] args) {
        String regex = ".*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*b.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*x";
        long start = System.nanoTime();
        boolean match = "aaabccc".matches(regex);
        long end = System.nanoTime();
        System.out.println((end - start) / 1e9);
    }

The regex consists of 30 times ".*" followed by a "b", followed by another 30 times ".*" and finally an "x". The time taken by String.matches to return is exponential on the number of ".*", i.e. it's O(eⁿ). If you replace 30 by 50, it will take more than a minute to return.

This is concerning because I was thinking of having a public HTTP API where the user would be able to enter a regex. If the regex is coming from outside, it is possible that a malicious user could enter a regex that would cause very high CPU on my server for a long time.

Is this a known issue in the Java API and does anyone know of a workaround, i.e. a way to inspect a regex to determine whether it is dangerous to use?

Klitos Kyriacou
  • 10,634
  • 2
  • 38
  • 70
  • It seems only Java and JavaScript are using regex engines that timesout when I tested this on regex101.com. All the other quickly concludes that there is no match. Perhaps you could use PCRE2 (if there are java bindings for it)? – Ted Lyngmo May 11 '23 at 21:37
  • @TedLyngmo PCRE2 took ~713k steps before failing (check *Regex Debugger*). – InSync May 11 '23 at 21:38
  • I don't know Java, but this seems to be related: [Is Java ReDos vulnerable?](https://stackoverflow.com/q/53048859) – InSync May 11 '23 at 21:41
  • 1
    DFA-based regex engines can match patterns like this in linear time (at the expense of usually using more memory). If you want to support tagged subexpressions, you can use a TDFA (tagged DFA), but they're more complex still. You can also do some quick hacks by pre-processing the user's input for obvious problems like this one--since `.*` (once) can match an arbitrary amount of input, you can safely collapse `.*.*` down to a single `.*`. In your example case, that reduces the pattern to `.*b.*x`, which should be fast on essentially any regex engine. – Jerry Coffin May 11 '23 at 21:58
  • 2
    A clever user can pretty easily outwit that simple-minded hack though. You can add more sophisticated hacks to make a DoS attack more difficult, but in the end, it probably makes more sense to just spawn the regex search into a separate process, and limit its resource usage, or use a [DFA-based regex engine](https://github.com/marianobarrios/dregex). – Jerry Coffin May 11 '23 at 22:01
  • @InSync Yes, I noticed but it did finish quickly. Unfortunately one can't check how many steps the other successful engines takes. I suggested PCRE2 because I think it has a lot of language bindings. – Ted Lyngmo May 11 '23 at 22:01

2 Answers2

0

According to Mastering Regular Expressions, 3rd Edition, Java uses a Traditional NFA regex engine that supports lazy quantifiers but has the disadvantage that some patterns can take a very long time to match.

DFA engines, being deterministic, don't suffer from that problem but are less powerful. In particular, lazy quantifiers are not possible with a DFA regex engine such as that found in "awk (most versions), egrep (most versions), flex, lex".

See also DFA Based Regular Expression Engines for Java with Capture.

Simon
  • 10,679
  • 1
  • 30
  • 44
0

SonarQube keeps alarming on regex like .*something.*. Even that must have bad performance, but compared to what?

Coming back to your question: Apparently Sonar has the ability to inspect a regex and determine the risk. While I have not checked where in the code it would decide to alarm, it is open source and can be found at https://github.com/SonarSource/sonarqube

Edit: There seem to be multiple checks for regex patterns in this directory: https://github.com/SonarSource/sonar-java/tree/master/java-checks/src/main/java/org/sonar/java/checks/regex

Edit2: The code uses factored out common classes. Here is another promising specimen: https://github.com/SonarSource/sonar-analyzer-commons/blob/master/regex-parsing/src/main/java/org/sonarsource/analyzer/commons/regex/finders/RedosFinder.java

Queeg
  • 7,748
  • 1
  • 16
  • 42