Java regular expression: avoiding logical operator

Question

The following regular expression creates a StackOverflowError when applied on a large html page:

<li.*?>(.|\s)*?</li>

My hypothesis is that it is due to the logical "OR" operator (|) that creates recursive calls in the matcher and, due to the large html page size that needs to be parsed, it creates the stack overflow.

Is there any way I can rewrite this regular expression without the "OR " operator (knowing that I want to capture content that is potentially split over multiple lines, hence the need of \s)?

Many thanks, Tom

I'm not using a proper HTML parser because input HTML is not well formed — Tom, May 14 '16 at 20:39
@Tom Please read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — amiller27, May 14 '16 at 20:45
Actually https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg is a more cogent explanation of the problem. — VGR, May 14 '16 at 20:47
For example, you could have an html string that looks like this: `
Use this `<(?:(?:/?\w+\s*/?)|(?:\w+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` — , May 14 '16 at 21:14

Joop Eggen · Accepted Answer · 2016-05-14T22:16:08.940

2

The following uses DOT_ALL, (?:s) to let the dot . also match line break characters.

(?s)<li[^>]*>.*?</li>

Important however is that no back throw to the <li...> occurs, hence the variation I chose.

edited May 14 '16 at 22:16

answered May 14 '16 at 20:37

Joop Eggen

107,315
7
83
138

It doesn't seem to work on a simple example (
foor bar

Tom

May 14 '16 at 20:58

Yes I had an extraneous colon; which I corrected now. – Joop Eggen May 14 '16 at 22:17

Java regular expression: avoiding logical operator

1 Answers1