Is it actually possible to parse freeform HTML with a regular expression?

Question

now before you prepare to right a speech about the perils of HTML parsing with regex, I already know it. This is more just a curiosity question, than actually wanting to know the question for practical usage.

Basically, given a file of HTML in some random, but perfectly valid format, can you parse out the content of <p> tags using a half-sane number of regular expressions? (and also pretending that <p> tags can not be nested or some other minor limitation)

You're saying: "I know that everyone says you shouldn't parse HTML with regex if you want to retain your sanity, but out of curiosity, is everybody lying?" — Lightness Races in Orbit, Jan 07 '11 at 01:59
If the HTML is totally valid and no `
` contains any nested tags, then it's relatively simple. Just have to strip all comments, script and such like, then find matching `
` tags. If the HTML is not valid, then it can be very difficult. — Orbling, Jan 07 '11 at 02:01
@Tomalak Geret'kal: It is perfectly possible to get bits of information out of an HTML file very efficiently with decent regex (PCRE) engines. Parsing the whole thing is another matter. — Orbling, Jan 07 '11 at 02:03
I don't even bother writing a speech anymore. I just link to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — cHao, Jan 07 '11 at 02:09

score 1 · Answer 1 · answered Jan 07 '11 at 02:01

1

Yes, under restrictions like valid HTML and non-nesting, you can use regular expressions for certain uses.

answered Jan 07 '11 at 02:01

Phrogz

296,393
112
651
745

3

Now you know that, go and find a proper parser and let us never speak of this again. – ijw Jan 07 '11 at 02:02
@ijw so I have a problem where I have this language and the only external library available is regular expressions.. and I need to parse some HTML lol jk :) – Earlz Jan 07 '11 at 02:13

score 1 · Accepted Answer · answered Jan 07 '11 at 02:02

It's certainly possible to extract all the text between {insert character sequence 1 here} and {insert character sequence 2 here} with regular expressions, so long as those sequences aren't overlapping. For example:

/(?<{insert character sequence 1 here}).*?(?={insert character sequence 2 here})/

Of course, it's terribly brittle and will break horribly if what you're running it on is even slightly malformed, or contains either character sequence outside the context where it's meaningful, or any number of other ways. If you oversimplify the problem, then yes you can get away with an oversimplified solution.

score 0 · Answer 3 · answered Jan 07 '11 at 02:07

It depends on what you limitations you'd consider minor. XHTML, for one obvious example, is somewhat more amenable to simple parsing. A great deal depends on whether you're thinking in terms of parsing existing HTML, or generating new HTML that could be parsed relatively easily. For the former case, I'd say the restrictions were major -- i.e., you'd need to know a great deal about the specific HTML in question to parse it. For the latter case, I'd say the restrictions were fairly trivial -- i.e., would only involve how you write the HTML, but would not affect what you could express in HTML.

Is it actually possible to parse freeform HTML with a regular expression?

3 Answers3