Regular Expression in Java Performance Issue

Question

I want to extract all text within HTML-body-Tags with the following Java-code:

Pattern.compile(".*<\\s*body\\s*>(.*?)<\\s*/\\s*body\\s*>.*", Pattern.DOTALL);

..

matcher.find() ? matcher.group(1) : originalText

That works fine for html, but for larger texts which don't contain any html (and with that no body-elements) e.G. larger stack-traces the invocation of matcher.find() takes lots of time.

Does anyone know how what's the cause? And how to make this regular expression even more performant?

Thanks in advance!

[Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) — Biffen, Feb 10 '15 at 09:53
This actually matches the whole document capturing very little.Remove the `.*` at the end of your regex — vks, Feb 10 '15 at 09:54
I do not want to parse any HTML, only extract everything within the BODY. — vhunsicker, Feb 10 '15 at 09:55
You should really look at [regex quantifiers](http://docs.oracle.com/javase/tutorial/essential/regex/quant.html) . Don't sue *greedy quantifier* everywhere. — TheLostMind, Feb 10 '15 at 09:56
(Java's regex package is possible to http://en.wikipedia.org/wiki/ReDoS.) The `.*` parts cause the problem in this case, but the `.*?` is probably even worse, as it has to backtrack more often once it finds ``. — Gábor Bakos, Feb 10 '15 at 09:59
I made all quantifiers non-greedy and removed the last .* But finding the matches takes also long time. No body-elements within a larger text is the worst-case-scenario for this regExp. — vhunsicker, Feb 10 '15 at 10:05
@vhunsicker ‘*I do not want to parse any HTML, only extract everything within the BODY.*’ That *is* parsing. — Biffen, Feb 10 '15 at 10:20
Sure @Biffen. My problem by now is the low performance during parsing a stack trace which contains no html. — vhunsicker, Feb 10 '15 at 10:22

vhunsicker · Accepted Answer · 2015-02-10T11:13:14.227

2

The reg exp is now:

<\\s*?body\\s*?>(.*?)<\\s*?/\\s*?body\\s*?>

The .* at the beginning and at the end of the expression was removed and now it works properly and fast. Further all quantifiers are now non-greedy.

Thanks for your helpful comments !

edited Feb 10 '15 at 11:13

answered Feb 10 '15 at 10:26

vhunsicker

538
6
20

Regular Expression in Java Performance Issue

1 Answers1