0

I want to extract all text within HTML-body-Tags with the following Java-code:

Pattern.compile(".*<\\s*body\\s*>(.*?)<\\s*/\\s*body\\s*>.*", Pattern.DOTALL);

..

matcher.find() ? matcher.group(1) : originalText

That works fine for html, but for larger texts which don't contain any html (and with that no body-elements) e.G. larger stack-traces the invocation of matcher.find() takes lots of time.

Does anyone know how what's the cause? And how to make this regular expression even more performant?

Thanks in advance!

vhunsicker
  • 538
  • 6
  • 20
  • 2
    [Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) – Biffen Feb 10 '15 at 09:53
  • 1
    This actually matches the whole document capturing very little.Remove the `.*` at the end of your regex – vks Feb 10 '15 at 09:54
  • 1
    Use JSoup. Don't use regex to parse HTML. – TheLostMind Feb 10 '15 at 09:54
  • I do not want to parse any HTML, only extract everything within the BODY. – vhunsicker Feb 10 '15 at 09:55
  • 2
    You should really look at [regex quantifiers](http://docs.oracle.com/javase/tutorial/essential/regex/quant.html) . Don't sue *greedy quantifier* everywhere. – TheLostMind Feb 10 '15 at 09:56
  • 1
    (Java's regex package is possible to http://en.wikipedia.org/wiki/ReDoS.) The `.*` parts cause the problem in this case, but the `.*?` is probably even worse, as it has to backtrack more often once it finds ``. – Gábor Bakos Feb 10 '15 at 09:59
  • I made all quantifiers non-greedy and removed the last .* But finding the matches takes also long time. No body-elements within a larger text is the worst-case-scenario for this regExp. – vhunsicker Feb 10 '15 at 10:05
  • @vhunsicker ‘*I do not want to parse any HTML, only extract everything within the BODY.*’ That *is* parsing. – Biffen Feb 10 '15 at 10:20
  • Sure @Biffen. My problem by now is the low performance during parsing a stack trace which contains no html. – vhunsicker Feb 10 '15 at 10:22

1 Answers1

2

The reg exp is now:

<\\s*?body\\s*?>(.*?)<\\s*?/\\s*?body\\s*?>

The .* at the beginning and at the end of the expression was removed and now it works properly and fast. Further all quantifiers are now non-greedy.

Thanks for your helpful comments !

vhunsicker
  • 538
  • 6
  • 20