1

I am trying to use a multi-line regex to match all wildcards in a given source string. These strings can be in excess of 70,000 lines and each item is separated by a new line.

I seem to be experiencing huge processing times for my current regex and I can only assume that this is because it is probably poorly constructed and inefficient. If I execute the code on my phone it seems to run for an eternity.

My current regex:

(?im)(?=^(?:\*|.+\*$))^(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$

Valid wildcard examples:

*test.com
*.test.com
*test
test.*
test*
*test*

I compile the pattern with:

private static final String WILDCARD_PATTERN = "(?im)(?=^(?:\\*|.+\\*$))^(?:\\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\\.)?)+(?:[a-z0-9]+)(?:[.-]?\\*)?$";
private static final Pattern wildcard_r = Pattern.compile(WILDCARD_PATTERN);

I look for matches with:

// Wildcards
while (wildcardPatternMatch.find()) {
    String wildcard = wildcardPatternMatch.group();
    myProperty.add(new property(wildcard, providerId));
    System.out.println(wildcard);
}

Are there any changes I can make to improve / optimise the regex or do I need to look at running .replaceAll several times to remove all of the clutter before passing for regex matching?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
mmotti
  • 341
  • 1
  • 3
  • 12
  • 1
    Probably, moving the first lookahead after `^` will already give you a boost: [`(?im)^(?=\*|.+\*$)(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$`](https://regex101.com/r/aIPq35/1) – Wiktor Stribiżew Jul 27 '18 at 14:23
  • 1
    A better pattern is [`(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$`](https://regex101.com/r/aIPq35/2). Let know if this fixes the problem – Wiktor Stribiżew Jul 27 '18 at 14:26
  • 2
    Is there one pattern per line? If so, you can streamline things considerably by not trying to cram the whole analysis into a single regular expression. – VGR Jul 27 '18 at 15:58
  • Thanks! I will apply the change and see if if makes any difference. If it does, would you mind explaining the differences? – mmotti Jul 27 '18 at 18:14
  • Each 'host' or wildcard is on a new line. – mmotti Jul 27 '18 at 18:15
  • What do you mean? Does https://regex101.com/r/aIPq35/2 work? If not please explain. – Wiktor Stribiżew Jul 27 '18 at 18:29
  • It works perfectly. Thank you! How-come it's so much faster? I just wanted to know which changes specifically resolved the slow processing? – mmotti Jul 27 '18 at 18:32
  • See my answer below. – Wiktor Stribiżew Jul 27 '18 at 19:19

2 Answers2

2

The pattern you need is

(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$

See the regex demo

Main points:

  • The first lookahead should be after ^. If it is before, the check is done before and after each char in the string. Once it is after ^, it is only performed once at the start of a line
  • The (?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ part, although short, is actually killing performance since the (?:(?<!-)\.)? is optional pattern, and the whole pattern gets reduced to (a+)+, a known type of pattern that causes catastrphic backtracking granted there are other subpatterns to the right of it. You need to unwrap it, the best "linear" way is [a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*.

The rest is OK.

Details

  • (?im) - case insensitive and multiline modifiers
  • ^ - start of a line
  • (?=\*|.+\*$) - the string should either start or end with *
  • (?:\*[.-]?)? - an optional substring matching a * and an optional . or -` char
  • [a-z0-9](?:[a-z0-9-]*[a-z0-9])? - an alphanumeric char followed with an optional sequence of any 0+ alphanumeric chars or/and - followed with an alphanumeric char
  • (?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)* - 0 or more sequences of a dot followed with the pattern described above
  • (?:[.-]?\*)? - an optional substring matching an optional . or -char and then a*`
  • $ - end of a line.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

I'd suggest taking a look at https://en.wikipedia.org/wiki/ReDoS#Evil_regexes

Your regex contains a repeated pattern:

(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ 

Just as a quick example of how this might slow it down, take a look at the processing time on these two examples: exact matches versus having extra characters at the end and even worse, that set repeated several times

Edit: Another good reference

Artichoke
  • 94
  • 1
  • 4
  • No problem, I also noticed when I looked at it that putting **, *-, -- or -* in the inputs caused it to take a long time. I'm not sure if your data would contain those strings by any chance, but it'd be worth looking changing the regex to accommodate that into if it does. – Artichoke Jul 27 '18 at 18:43