0

The following regular expression creates a StackOverflowError when applied on a large html page:

<li.*?>(.|\s)*?</li>

My hypothesis is that it is due to the logical "OR" operator (|) that creates recursive calls in the matcher and, due to the large html page size that needs to be parsed, it creates the stack overflow.

Is there any way I can rewrite this regular expression without the "OR " operator (knowing that I want to capture content that is potentially split over multiple lines, hence the need of \s)?

Many thanks, Tom

Tom
  • 1,375
  • 3
  • 24
  • 45
  • 1
    Any reason why you are not using proper HTML/XML parser? – Pshemo May 14 '16 at 20:36
  • [tag:jsoup] is your friend. – Yassin Hajaj May 14 '16 at 20:38
  • I'm not using a proper HTML parser because input HTML is not well formed – Tom May 14 '16 at 20:39
  • @Tom Please read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – amiller27 May 14 '16 at 20:45
  • Actually https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg is a more cogent explanation of the problem. – VGR May 14 '16 at 20:47
  • For example, you could have an html string that looks like this: `
  • `. It's perfectly valid html, and a regex parser is going to have a miserable time with that. – amiller27 May 14 '16 at 20:48
  • Use this `<(?:(?:/?\w+\s*/?)|(?:\w+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` –  May 14 '16 at 21:14
  • `(.|\s)` = `[\S\s]` = `(?s).` –  May 14 '16 at 21:17