Why use dom to parse webpages instead of regex?

Question

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.

Have a look at http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg — Qtax, Apr 04 '12 at 09:57
@Qtax - Really? I though [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) is much more convincing... — Oded, Apr 04 '12 at 09:59
Please read this great answer: [Oh Yes You Can Use Regexes to Parse HTML!](http://stackoverflow.com/a/4234491/626273) (and also [this one](http://stackoverflow.com/a/4234582/626273)) — stema, Apr 04 '12 at 10:11
@stema I think it's important to point out that, while that first answer is very well written, it is not merely "using a regex to parse HTML". It's a program which uses regexes for syntactic matching, but there's much more to it than that core... look at the `parse_input_tags` method of the provided Perl program. — Borealid, Apr 04 '12 at 10:38
@Borealid of course you have to read that answer and not only the headline in my comment. And also tchrists comment to the second link I provided. I think the basic misunderstanding is to assume you can "parse" anything with a single regex, a regex will all the time only match a pattern. There is a lack of understanding what parsing is doing and what a regex is doing. — stema, Apr 04 '12 at 10:45
@Borealid, my thoughts exactly. It's a bad example. You could use regex to parse HTML, but you would not get a parse tree, regex can't generate such a thing (without use of other code), but the expression could be made to understand the full source structure (on a deeper level than tokens), and built to extract (flat) parts that you are interested in. — Qtax, Apr 04 '12 at 10:46

score 7 · Accepted Answer · answered Apr 04 '12 at 09:59

7

A DOM parser is actually parsing the page.

A regular expression is searching for text, not understanding the HTML's semantic meaning.

It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.

You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.

Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.

So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).

answered Apr 04 '12 at 09:59

Borealid

95,191
9
106
122

1

"Regex" are not regular, eg http://stackoverflow.com/questions/7434272/match-an-bn-cn-e-g-aaabbbccc-using-regular-expressions-pcre – Qtax Apr 04 '12 at 10:08
@Qtax Yeah, in theory. But that doesn't make then suitable for parsing real context-free languages. Such as HTML, you know. It's even looser than free-form XML, and I dare you come up with a regex that parses XML 1.1 properly. Or even an arbitary nontrivial language built on XML. – Apr 04 '12 at 10:12
@delnan, if it's suitable or not depends on the situation and the language. Have a look at the comments on the question. I didn't say it's suitable to use regex to parse HTML (in general), there are better tools for that. Oh, and I dare you to build an an airplane with the same specs as A380. [...] Just because you can't do it doesn't mean that it's impossible. – Qtax Apr 04 '12 at 10:29
@Qtax: Without doubt, all kinds of hard *examples* of CFGs are possible to do with. The dare wasn't specifically directed at you, anyone feel free to try it. And it doesn't have to be impossible (even I may manage, if you hold me at gunpoint, wait me a few months and give me internet access), but I'm certain the result will be horrific, unmaintainable, riddled with corner cases and restrictions, etc. and may exhibit expotential complexity (severeal regex extensions are prone to that) that makes it take years for documents larger than a few kB. – Apr 04 '12 at 10:34
Would DOM be more efficient that Regex? – Jürgen Paul Apr 04 '12 at 11:18
@user1099531 In terms of what? In programmer time, almost certainly. In runtime, probably not. But do you care about that? I would urge you to not consider "performance" at the cost of anything else until you know, for certain, that you have a need. – Borealid Apr 04 '12 at 12:33

score 1 · Answer 2 · answered Apr 04 '12 at 10:32

I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.

The simple answer is:

A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.

If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.

In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...

score -1 · Answer 3 · answered Apr 04 '12 at 09:58

-1

To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML might be not formed properly, then DOM parser can fail.

Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.

answered Apr 04 '12 at 09:58

haltabush

4,508
2
24
41

3

This phrase is the wrong way around! Regex is undoubtedly less reliable then the DOM for retrieving HTML, if the HTML is not formed properly then the page shouldnt work anyway! – Ben Carey Apr 04 '12 at 10:02
1

There are DOM parsers (in every browser, and in libraries like [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)) that do a really good job at *not* breaking on invalid HTML but rather making the most of it. Gives you all the power without taking hours of regex twiddling, and doesn't leave in in fear and uncertainty that your stuff will break on the next page to come around. – Apr 04 '12 at 10:07

Why use dom to parse webpages instead of regex?

3 Answers3

Linked