1

I noticed that when I match a regular expression like the following one on a text it is a lot slower than the one without preceeding and trailing (.*) parts. I did the same on perl and found that for perl it hardly makes a difference. Is there any way to optimize the original regular expression "(.*)someRegex(.*)" for java?

Pattern p = Pattern.compile("(.*)someRegex(.*)");
Matcher m = p.matcher("some text");
m.matches();

Pattern p = Pattern.compile("someRegex");
Matcher m = p.matcher("some text");
m.matches();

Edit: Here is a concrete example:

(.*?)<b>\s*([^<]*)\s*<\/b>(.*)
hansi
  • 2,278
  • 6
  • 34
  • 42
  • 4
    Why would you want `(.*)someRegex(.*)`? To grab everything *but* someRegex in the text? – David B Sep 10 '12 at 16:43
  • this is just a simplified abstract example. someRegex could be any regular expression. – hansi Sep 10 '12 at 16:49
  • 1
    But to me, the `(.*)` looks pretty useless. If using the second version is much faster, then you might have answered your own question. – David B Sep 10 '12 at 16:50
  • How is it useless if I want to extract the part matched by the group (.*) ? – hansi Sep 10 '12 at 16:52
  • Do you know another way I could get that part using the second version? – hansi Sep 10 '12 at 16:53
  • `someRegex` may be an abstraction that oversimplifies a relevant detail. If `someRegex` is a literal anchor, I doubt you would have a problem. If my suspicion is correct, we would need to know what `someRegex` minimally contains such that the issue can be reproduced before we can give a meaningful answer. – DavidO Sep 10 '12 at 17:04
  • I added an example for someRegex above – hansi Sep 10 '12 at 17:17
  • 1
    Oh. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#answer-1732454 – Axeman Sep 10 '12 at 17:47
  • Updated my answer to suggest using `find` instead of `matches` and skip the `.*` altogether. – zostay Sep 11 '12 at 15:15
  • btw, your concrete example won't work well because you have to escape the backslashes when doing regular expressions in Java – zostay Sep 11 '12 at 15:18

3 Answers3

4

Your best bet is to skip trying to match the front and end of the string at all. You must do that if you use the matches() method, but you don't if you use the find() method. That's probably what you want instead.

Pattern p = Pattern.compile("<b>\\s*([^<]*)\\s*<\\/b>");
Matcher m = p.matcher("some <b>text</b>");
m.find();

You can use start() and end() to find the indexes within the source string containing the match. You can use group() to find the contents of the () capture within the match (i.e., the text inside the bold tag.

In my experience, using regular expressions to process HTML is very fragile and works well in only the most trivial cases. You might have better luck using a full blown XML parser instead, but if this is one of those trivial cases, have at it.

Original Answer: Here is my original answer sharing why a .* at the beginning of a match will perform so badly.

The problem with using .* at the front is that it will cause lots of backtracking in your match. For example, consider the following:

Pattern p = Pattern.compile("(.*)ab(.*)");
Matcher m = p.matcher("aaabaaa");
m.matches();

The match will proceed like this:

  1. The matcher will attempt to suck the whole string, "aaabaaa", into the first .*, but then tries to match a and fails.
  2. The matcher will back up and match "aaabaa", then tries to match a and succeeds, but tries to match b and fails.
  3. The matcher will back up and match "aaaba", then tries to match a and succeeds, but tries to match b and fails.
  4. The matcher will back up and match "aaab", then tries to match a and succeeds, but tries to match b and fails.
  5. The matcher will back up and match "aaa", then tries to match a and fails.
  6. The matcher will back up and match "aa", then tries to match a and succeeds, tries b and succeeds, and then matches "aaa" to the final .*. Success.

You want to avoid a really broad match toward the beginning of your pattern matches whenever possible. Without knowing your actual problem, it would be very difficult to suggest something better.

Update: Anirudha suggests using (.*?)ab(.*) as a possible fix to avoid backtracking. This will short circuit backtracking to some extent, but at the cost of trying to apply the next match on each try. So now, consider the following:

Pattern p = Pattern.compile("(.*?)ab(.*)");
Matcher m = p.matcher("aaabaaa");
m.matches();

It will proceed like this:

  1. The matcher will attempt to match nothing, "", into the first .*?, tries to match a and succeeds, but fails to match b.
  2. The matcher will attempt to match the first letter, "a", into the first .*?, tries to match a and succeeds, but fails to match b.
  3. The matcher will attempt to match the first two letters, "aa", into the first .*?, tries to match a and succeeds, tries to match b and succeeds, and then slurps up the rest into .*, "aaa". Success.

There aren't any backtracks this time, but we still have a more complicated matching process for each forward move within .*?. This may be a performance gain for a particular match or a loss if iterating through the match forward happens to be slower.

This also changes the way the match will proceed. The .* match is greedy and tries to match as much as possible where as .*? is more conservative.

For example, the string "aaabaaabaaa".

The first pattern, (.*)ab(.*) will match "aaabaa" to the first capture and "aaa" to the second.

The second pattern, (.*?)ab(.*) will match "aa" to the first capture and "aaabaaa" to the second.

zostay
  • 3,985
  • 21
  • 30
3

Instead of doing "(.*)someRegex(.*)" , why not just split the string on "someRegex" and get the parts from the resulting array ? This will give you the same result, but much faster and simpler. Java supports splitting by regex if you need it - http://www.regular-expressions.info/java.html

Michael Low
  • 24,276
  • 16
  • 82
  • 119
1

. matches every character

instead of . try limiting your search by using classes like \w or \s.

But I dont' guarantee that it would run fast.

It all depends on the amount of text you are matching!

Anirudha
  • 32,393
  • 7
  • 68
  • 89