I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation.
Indeed, the following Java test shows how slow the wrong regular expression can be:
import org.junit.Test;
public class RegexTest {
@Test
public void fastRegex1() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
}
@Test
public void fastRegex2() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
}
@Test
public void slowRegex() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
}
}
As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8)
The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java.
$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa
real 0m0.004s
user 0m0.000s
sys 0m0.004s
$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab
real 0m0.018s
user 0m0.015s
sys 0m0.004s
What is it about the extra matching modifier +
or trailing character s
in the data which makes this regular expression so slow, and why is it only specific to Java?