51

I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation.

Indeed, the following Java test shows how slow the wrong regular expression can be:

    import org.junit.Test;

    public class RegexTest {

    @Test
    public void fastRegex1() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
    }

    @Test
    public void fastRegex2() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
    }

    @Test
    public void slowRegex() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
    }
}

As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8)

Enter image description here

The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java.

$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa

real    0m0.004s
user    0m0.000s
sys     0m0.004s

$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab

real    0m0.018s
user    0m0.015s
sys     0m0.004s

What is it about the extra matching modifier + or trailing character s in the data which makes this regular expression so slow, and why is it only specific to Java?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
AntonPiatek
  • 823
  • 6
  • 11
  • 2
    How often did you run the tests? Try using JMH. – JCWasmx86 Aug 07 '20 at 08:10
  • 2
    Why you use `(a+)+b` why not just `a+b`? if you are looking to just match and not to get the group before the b, then I don't see any reason to use group and + after group! – Youcef LAIDANI Aug 07 '20 at 08:10
  • 6
    On the site linked by you there is a link to OWASP, it explains the problem a little bit more: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS – Amongalen Aug 07 '20 at 08:16
  • @AntonPiatek, what Java version do you have? – JCWasmx86 Aug 07 '20 at 08:18
  • 29
    For your information: what you're experiencing is called [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html) – Thomas Aug 07 '20 at 08:18
  • https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java ....meaning: when you run your tests just once, then you really shouldnt jump to conclusions. – GhostCat Aug 07 '20 at 08:20
  • Looks like python regexps have different semantics. – talex Aug 07 '20 at 08:38
  • This would have been on Java 8 running out of intelliJ - I can easily imagine that this has been improved in later Java versions and will try that when I get a chance – AntonPiatek Aug 07 '20 at 10:22
  • This is not specific to Java. This is an apples to oranges comparison. – JimmyJames Aug 07 '20 at 18:01
  • related: [Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)](https://swtch.com/~rsc/regexp/regexp1.html) – L. F. Aug 08 '20 at 02:40
  • 3
    Java's `.matches("(a+)b")` is actually equivalent to Perl's `/^(a+)+b\z/ ` – ikegami Aug 08 '20 at 03:31
  • 1
    For the benefit of SEO: some of the regexes fall into the category of evil regex. – TAR86 Aug 08 '20 at 06:52
  • @TAR86 That (probably) won't help (much); comments aren't (officially) indexed by search engines. – wizzwizz4 Aug 09 '20 at 12:18
  • If you're sensitive to execution time of random regular expressions (e.g. you take regular expressions as user input), you absolutely need to use a one-pass engine like https://github.com/google/re2/wiki/WhyRE2 – liori Aug 09 '20 at 16:08

4 Answers4

54

Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions).

"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")

There are lots of ways that lots of as could match:

(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
etc.

For the input string "aaaaaaaaaaaaaaaaaaaaaaaaaaaab", it will greedily match all of those as in a single pass, match the b, job done.

For "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs", when it gets to the end and finds that the string doesn't match (because of the s), it's not correctly recognizing that the s means it can never match. So, having gone through and likely matched as

(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs

it thinks "Oh, maybe it failed because of the way I grouped the as - and goes back and tries all the other combinations of the as.

(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs  // Nope, still no match
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs  // ...
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs  // ...
...

There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 as, because each a can either be part of the previous group, or start its own group), so it takes a long time.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • 1
    If I'm not mistaken there are `2^n` possible combinations, where `n` is equal to amount of `a`s in input string. – Amongalen Aug 07 '20 at 08:24
  • @Amongalen yes, I was just pondering that and edited in as you commented. – Andy Turner Aug 07 '20 at 08:24
  • 4
    "And I can't answer why Java suffers from this but not the others" Python has similarly poor performance (maybe worse). It's just that the Python `search` is not doing the same thing as Java `matches`. – JimmyJames Aug 07 '20 at 18:09
  • 2
    Re. "this is really conjecture": This is essentially correct. That's how Java regexes work. Although the order of the attempts is not quite right. – Matt Timmermans Aug 07 '20 at 22:20
  • 4
    Re "*I can't answer why Java suffers from this but not the others*", Well, for starters, `"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/` matches with no backtracking whatsoever. `/^(a+)+b\z/` should have been used. That said, `/^(a+)+b\z/` is also very fast because the optimizer looks for `ab` right off the bat, and realizes the pattern can't possibly match. You can see that using `perl -Mre=debug -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /^(a+)+b\z/'` – ikegami Aug 08 '20 at 03:33
20

I don't know Perl too well but the Python version is not equivalent to the Java one. You are using search() but the Java version is using matches(). The equivalent method in Python would be fullmatch()

When I run your examples in Python (3.8.2) with search() I get quick results as you do. When I run it with fullmatch() I get poor (multi-second) execution time. Could it be that your Perl example is also not doing a full match?

BTW: if you want to try the Java version of search you would use:

Pattern.compile("(a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find();

There might be some slight difference in the semantics but it should be close enough for this purpose.

JimmyJames
  • 1,356
  • 1
  • 12
  • 24
  • 1
    The Perl code is equivalent to the Python `search()` code: if any substring matches the regular expression, it's considered a successful match. – Mark Aug 07 '20 at 21:15
  • @Mark Thanks. Would you happen to know the equivalent of 'fullmatch' in Perl? – JimmyJames Aug 07 '20 at 21:44
  • 1
    In Perl, you'd do it by changing the regular expression to include start and end anchors: `/^(a+)+b$/`. – Mark Aug 07 '20 at 21:47
  • 2
    It's very fast in perl v5.30: `echo aaaaaaaaaaaaaaaaaaaaaaaaaaaabs | perl -wne '/^((a+)+b)$/'` – kubanczyk Aug 07 '20 at 22:00
  • 8
    Re "*Could it be that your Perl example is also not doing a full match?*", Correct. The Perl equivalent would be `/^(a+)+b\z/` (not `/^(a+)+b$/`). That said, the optimizer realizes the pattern can't possibly match before it even starts matching, and aborts. So unlike Java and Python, `"aaa...aaabs" =~ /^(a+)+b\z/` fails instantly. You can see this using `perl -Mre=debug -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /^(a+)+b\z/'` (`Did not find floating substr "ab"$... Match rejected by optimizer`) – ikegami Aug 08 '20 at 03:40
  • @ikegami Thanks. Given the deep history of Perl and regex, it doesn't surprise me too much that it is 'smarter' than the Java and Python implementations. – JimmyJames Aug 10 '20 at 15:59
14

The extra + causes a lot of backtracking (in a naive regexp implementation) when the string cannot be matched. If the string can be matched, the answer is known in the first try. This explains why the case 2 is fast and only case 3 is slow.

Henry
  • 42,982
  • 7
  • 68
  • 84
  • I agree with the backtracking, but would it really be as slow as 12 seconds?! – f1sh Aug 07 '20 at 08:14
  • I couldn't reproduce the slow regex with a quick and dirty script (50000 iterations): 13055 ns (Measured with System.nanoTime()) – JCWasmx86 Aug 07 '20 at 08:15
  • @JCWasmx86 What input string did you use? – Amongalen Aug 07 '20 at 08:16
  • @Amongalen To measure: `long l=System.nanoTime();for(int i=0;i<50000;i++)"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");long l1=System.nanoTime();System.out.println(((double)l-l1)/50000);` – JCWasmx86 Aug 07 '20 at 08:17
  • 1
    Now with 5 million iterations, it's still around 10000 ns – JCWasmx86 Aug 07 '20 at 08:17
  • @f1sh there is really a lot of ways the `a`s can be matched, see also Andy Turner's answer. – Henry Aug 07 '20 at 08:22
  • @JCWasmx86 keep in mind that JVM might optimize that code, especially that you never use the results. – Amongalen Aug 07 '20 at 08:22
  • @Amongalen Wait some minutes, I'm running a JMH benchmark now – JCWasmx86 Aug 07 '20 at 08:25
  • @Henry I am aware of that, but 12 seconds is a veeeery long time to evaluate rules of a type3 language with word of that length – f1sh Aug 07 '20 at 08:35
  • @Amongalen Here are the test results from JMH (Shrinked a bit): https://pastebin.com/2DjUa1zL – JCWasmx86 Aug 07 '20 at 08:51
  • @f1sh, You can easily construct a string/pattern combination that would take longer than the current age of the Universe to complete. – ikegami Aug 08 '20 at 03:38
9

The site https://swtch.com/~rsc/regexp/regexp1.html has some detailed information on regular expression implementation techniques and the theory behind them. I know link only answers are bad, but this is worth reading, showing an example regular expression that completes in 30 micro seconds with the better implementation, and 60 seconds (2 million times slower) with the better known and more obvious way.

It says

"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."

Other answers saying that the extra + causes too much backtracking are correct, but only if you ignore the good theory.

icarus
  • 459
  • 3
  • 7
  • 4
    Using an NFA will run into *exactly the same issue* here as the more complex algorithms. (An NFA with n states can have up to 2^n paths through it). You have to understand the particular pathological RE used in that paper why an NFA is faster on that one and why that doesn't apply in the general case. – Voo Aug 08 '20 at 08:50
  • @Voo Why does the number of paths through an NFA matter? To simulate an NFA, all you have to store is the set of possible states at each step, that set will be of size at most n. – gmatht Aug 09 '20 at 06:51
  • @gmath The runtime depends on the number of possible paths through the automaton and not on the number of states. The point is if you're not running some optimizer that recognises certain patterns (and you can surely trick the optimizer) certain regular expressions will simply have 2^n runtime and which algorithm you pick doesn't matter. – Voo Aug 09 '20 at 17:46
  • @Voo All regular expressions can be run in linear time, not as you assert 2^n. It is not just an optimizer that you can trick. Over time people have added to what they call regular expressions so linear time is no longer a property that can be relied upon, but then they are no longer REs. – icarus Aug 09 '20 at 20:09
  • @Voo No. For example a NFA can be converted to a DFA, which is clearly linear in the length of input. You can read up about NFA on wikipedia. https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton#Implementation – gmatht Aug 10 '20 at 07:03
  • @gmatht Yeah not sure what I was thinking there. I'm putting it down to commenting before the first coffee in the morning. – Voo Aug 10 '20 at 07:40