What would it take to evolve regex into something that can parse HTML?

Question

Reading this amusing rant ( RegEx match open tags except XHTML self-contained tags ) I wondered ... how could regexes be changed to successfully parse HTML?

I'm looking here for suggestions that :

make the minimal addition to regexes as we know and love them (ie. not "make them look like XSLT!" type answers)
are robust enough to work properly.
suggest syntax (not just list the general requirements)

Has anyone actually made something like this?

I'm not sure I see the point of this (instead of using a DOM parser in the first place)? — Pekka, Feb 07 '11 at 15:04
Strict HTML or sloppy HTML? The first is a whole lot easier than the second - for all there is more or the second than of the first! — Jonathan Leffler, Feb 07 '11 at 15:06

score 2 · Answer 1 · answered Feb 07 '11 at 15:17

2

Add a new escape sequence:

\H -- match HTML document

answered Feb 07 '11 at 15:17

pdc

2,314
20
28

Now that's a cheap answer. About as cheap as "The shortest hello world program is in a language that outputs that text for every input". It would also require a whole HTML parser in the regex engine, which is rather impractical. – Feb 07 '11 at 15:19
'Parse an HTML document' is not really a meaningful requirement. Depending on what information you want to extract you will need to write a loops or conditionals in the host language in any event. In some cases you will need to maintain a stack of open elements while navigating the document. Trying to shoehorn all this in to regex syntax seems rather pointless. – pdc Feb 07 '11 at 15:25

score 1 · Answer 2 · answered Feb 07 '11 at 15:15

DOM/XML parsers internally use regex to parse html. The difference between them and using ONLY regex is to make up for the shortcomings of regex. One of the major shortcomings of regex is handling nested tags and malformed code (like missing tags). So around the basic regex, all sorts of algorithms and conditions are written to try and handle those things. And then there is of course the parts that actually create an object out of it.

So you asked what it would take to make regex do what a DOM/XML parser does? You would have to somehow cram all those algorithms and conditions into the regex engine, internally and within pattern syntax.

I personally do not wish for this to happen. IMO regex should be straight pattern matching. IMO it already has some stuff in it that IMO is questionable (some regex flavors do in fact have a way to use conditions, for instance). Taking the regex engine and then building a larger tool around it (like a DOM/XML parser) IMO is the best way to go.

and anyways...if you have the aptitude to master regex (which it seems in my experience most people don't), then learning the basics of http://www.php.net/dom is a walk in the park. — CrayonViolent, Feb 07 '11 at 15:26

score 0 · Answer 3 · answered Feb 07 '11 at 15:10

It's interesting that real world tools can be and often are modified to perform tasks they might not otherwise be suited for. For example, if someone were to attempt to eat broth with a fork, they would be largely unsuccessful. Enter the spork.

I don't think programmers necessarily work that way all the time. It's not uncommon for tools to expand their scope, but it's also been a long tradition that programmers try to use specific tools for specific purposes.

Now, it just so happens that in order for regex to be able to parse HTML, it would have to be a pattern matcher/recognizer that also remembered state. This is, to a T, exactly what a parser does. It uses pattern matching (indeed, it often uses regex!) in order to match tokens. It then remembers combinations of tokens.

So in fact regex is used very frequently to parse HTML, along with other functions that remember larger patterns that cannot be described or processed using regex alone.

Hope that answers the question.

Broth? Sporks are for eating MRE's! Or Enchiritos. – Alan Moore Feb 07 '11 at 18:27 — Alan Moore, Feb 07 '11 at 18:27

score 0 · Answer 4 · answered Feb 07 '11 at 15:10

0

Perl 6 has a regex extension that is designed to do that: http://en.wikipedia.org/wiki/Perl_6_rules.

answered Feb 07 '11 at 15:10

Jeremiah Willcock

30,161
7
76
78

The theoretical guys already whine that PCRE aren't regular. *This* is not even remotely regex any more... – Feb 07 '11 at 15:20

score 0 · Answer 5 · answered Feb 07 '11 at 15:13

Depends what you mean by "parse". Typically this involves transforming a character stream into an object tree. To do this with regular expressions you would need to completely change capturing groups to be runtime-variable multi-node tree, rather than the compile-time-fixed array that they currently are. Once you've done that you've just re-written lex/yacc.

What would it take to evolve regex into something that can parse HTML?

5 Answers5