Java - Multiline regex bad performance with large string

Question

I am trying to use a multi-line regex to match all wildcards in a given source string. These strings can be in excess of 70,000 lines and each item is separated by a new line.

I seem to be experiencing huge processing times for my current regex and I can only assume that this is because it is probably poorly constructed and inefficient. If I execute the code on my phone it seems to run for an eternity.

My current regex:

(?im)(?=^(?:\*|.+\*$))^(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$

Valid wildcard examples:

*test.com
*.test.com
*test
test.*
test*
*test*

I compile the pattern with:

private static final String WILDCARD_PATTERN = "(?im)(?=^(?:\\*|.+\\*$))^(?:\\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\\.)?)+(?:[a-z0-9]+)(?:[.-]?\\*)?$";
private static final Pattern wildcard_r = Pattern.compile(WILDCARD_PATTERN);

I look for matches with:

// Wildcards
while (wildcardPatternMatch.find()) {
    String wildcard = wildcardPatternMatch.group();
    myProperty.add(new property(wildcard, providerId));
    System.out.println(wildcard);
}

Are there any changes I can make to improve / optimise the regex or do I need to look at running .replaceAll several times to remove all of the clutter before passing for regex matching?

Probably, moving the first lookahead after `^` will already give you a boost: [`(?im)^(?=\*|.+\*$)(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$`](https://regex101.com/r/aIPq35/1) — Wiktor Stribiżew, Jul 27 '18 at 14:23
A better pattern is [`(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$`](https://regex101.com/r/aIPq35/2). Let know if this fixes the problem — Wiktor Stribiżew, Jul 27 '18 at 14:26
Is there one pattern per line? If so, you can streamline things considerably by not trying to cram the whole analysis into a single regular expression. — VGR, Jul 27 '18 at 15:58
Thanks! I will apply the change and see if if makes any difference. If it does, would you mind explaining the differences? — mmotti, Jul 27 '18 at 18:14
What do you mean? Does https://regex101.com/r/aIPq35/2 work? If not please explain. — Wiktor Stribiżew, Jul 27 '18 at 18:29
It works perfectly. Thank you! How-come it's so much faster? I just wanted to know which changes specifically resolved the slow processing? — mmotti, Jul 27 '18 at 18:32

score 2 · Accepted Answer · answered Jul 27 '18 at 19:19

The pattern you need is

(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$

See the regex demo

Main points:

The first lookahead should be after ^. If it is before, the check is done before and after each char in the string. Once it is after ^, it is only performed once at the start of a line
The (?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ part, although short, is actually killing performance since the (?:(?<!-)\.)? is optional pattern, and the whole pattern gets reduced to (a+)+, a known type of pattern that causes catastrphic backtracking granted there are other subpatterns to the right of it. You need to unwrap it, the best "linear" way is [a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*.

The rest is OK.

Details

(?im) - case insensitive and multiline modifiers
^ - start of a line
(?=\*|.+\*$) - the string should either start or end with *
(?:\*[.-]?)? - an optional substring matching a * and an optional . or -` char
[a-z0-9](?:[a-z0-9-]*[a-z0-9])? - an alphanumeric char followed with an optional sequence of any 0+ alphanumeric chars or/and - followed with an alphanumeric char
(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)* - 0 or more sequences of a dot followed with the pattern described above
(?:[.-]?\*)? - an optional substring matching an optional . or -char and then a*`
$ - end of a line.

Excellent. Thanks so much for taking the time to give a detailed explanation. — mmotti, Jul 27 '18 at 19:24

score 1 · Answer 2 · answered Jul 27 '18 at 14:24

1

I'd suggest taking a look at https://en.wikipedia.org/wiki/ReDoS#Evil_regexes

Your regex contains a repeated pattern:

(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+

Just as a quick example of how this might slow it down, take a look at the processing time on these two examples: exact matches versus having extra characters at the end and even worse, that set repeated several times

Edit: Another good reference

answered Jul 27 '18 at 14:24

Artichoke

94
1
4

No problem, I also noticed when I looked at it that putting **, *-, -- or -* in the inputs caused it to take a long time. I'm not sure if your data would contain those strings by any chance, but it'd be worth looking changing the regex to accommodate that into if it does. – Artichoke Jul 27 '18 at 18:43

Java - Multiline regex bad performance with large string

2 Answers2