7

My experience tells me that one should not use RegExp to parse HTML/XML, and I completely agree! It's

  • Messy
  • Not robust and easily broken
  • Pure evil

They all say "use a DOM parser" of some sort, which is fine by me. But now I got curious. How do those work?

I was searching for the DOMDocument class source, and couldn't find it.

This question comes from the fact that filter_var() for instance, is considered a good alternative for validating emails with RegExp, but when you look at the source, you'll see it actually uses RegExp itself!

So, if you were to build a DOM Parser in PHP? How would you go about parsing the HTML? How did they do it?

Community
  • 1
  • 1
Madara's Ghost
  • 172,118
  • 50
  • 264
  • 308
  • 2
    DOM parsers are normally implemented as tokenizers. If you can read C#, the source code for the [HTML Agility Pack](http://htmlagilitypack.codeplex.com) can make the approach clear. – Oded May 05 '12 at 08:34
  • 4
    About `filter_var()`: nobody ever said that you should not validate an email address with regex. The fact is, that writing a regex for this task that is *right* is very difficult and requires a lot of research effort. So there are thousands of terrible implementations out there. That is why you should simply use `filter_var()`. – kapa May 05 '12 at 08:35
  • If you're dealing with XHTML, a normal XML parser will work just fine. – Niko May 05 '12 at 08:36
  • 6
    I think you should check out the article [How Browsers Work: Behind the Scenes of Modern Web Browsers](http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/#HTML_Parser). It's a lengthy read, but well worth your time. Specifically, the HTML Parser section. – Sampson May 05 '12 at 08:37
  • @JonathanSampson: I've read it, and it's very good. If you could write a full answer on it, I'll make sure to accept it :) – Madara's Ghost May 05 '12 at 09:05
  • It could as well be very optimized C code token parser. Anyway, since its in C/C++, it would be many times more efficient than the whole parser written in PHP to do the same thing. – thevikas May 05 '12 at 13:43
  • @bažmegakapa even `filter_var()` for email validation isn't (/ wasn't) right... – PeeHaa May 05 '12 at 17:16
  • @RepWhoringPeeHaa But that can be fixed. In one place. Once. And there is intention to fix it. – kapa May 05 '12 at 17:19
  • @bažmegakapa yup but as long it isn't fixed and you need to validate obscure emailaddresses you cannot use it – PeeHaa May 05 '12 at 17:21
  • Searching for "DOMDocument" on PHP OpenGrok suggests searching for various `dom_document_*` symbols, such as [`dom_document_save`](http://lxr.php.net/search?defs=dom_document_save&project=PHP_5_4), which leads to the [DOM extension source](http://svn.php.net/viewvc/php/php-src/trunk/ext/dom/). – outis May 05 '12 at 17:24
  • While I understood and learned what I wanted to learn. I'm waiting for a full answer to accept. If anyone is in the mood to answer and get a free +25 from me, be my guest :) – Madara's Ghost May 05 '12 at 17:27

2 Answers2

5

I think you should check out the article How Browsers Work: Behind the Scenes of Modern Web Browsers. It's a lengthy read, but well worth your time. Specifically, the HTML Parser section.

While I cannot do the article justice, perhaps a cursory summary will be good to hold one over until they have the time to read and digest that masterpiece. I must admit though, in this area I am a novice having very little experience. Having developed for the web professionally for about 10 years, the way in which the browser handles and interprets my code has long been a black box.

HTML, XHTML, CSS or JavaScript - take your pick. They all have a grammer, as well as a vocabulary. English is another great example. We have grammatical rules that we expect people, books, and more to follow. We also have a vocabulary made up of nouns, verbs, adjectives and more.

Browsers interpret a document by examining its grammar, as well as its vocabulary. When it comes across items it ultimately doesn't understand, it will let you know (raising exceptions, etc). You and I do the same in common-speak.

I love StackOverflow, but if I could changed one thing it would be be absolutamente broken...

Note in the example above how you immediately start to pick apart the words and relationships between words. The beginning makes complete sense, "I love StackOverflow." Then we come to "...if I could changed," and we immediately stop. "Changed" doesn't belong here. It's likely the author meant "change" instead. Now the vocabulary is right, but the grammar is wrong. A little later we come across "be be" which may also violate a grammatical rule, and just a bit further we encounter the word "absolutamente", which is not part of the English vocabulary - another mistake.

Think of all of this in terms of a DOCTYPE. I have right now opened up on my second monitor the source behind XHTML 1.0 Strict Doctype. Among its internals are lines like the following:

<!ENTITY % heading "h1|h2|h3|h4|h5|h6">

This defines the heading entities. And as long as I adhere to the grammar of XHTML, I can use any one of these in my document (<h1>Hello World</h1>). But if I try to make one up, say H7, the browser will stumble over the vocabulary as "foreign," and inform me:

"Line 7, Column 8: element "h7" undefined"

Perhaps while parsing the document we come across <table. We know that we're now dealing with a table element, which has its own set of vocabulary such as tbody, tr, etc. As long as we know the language, the grammar rules, etc., we know when something is wrong. Returning to the XHTML 1.0 Strict Doctype, we find the following:

<!ELEMENT table
     (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
<!ELEMENT caption  %Inline;>
<!ELEMENT thead    (tr)+>
<!ELEMENT tfoot    (tr)+>
<!ELEMENT tbody    (tr)+>
<!ELEMENT colgroup (col)*>
<!ELEMENT col      EMPTY>
<!ELEMENT tr       (th|td)+>
<!ELEMENT th       %Flow;>
<!ELEMENT td       %Flow;>

Given this reference, we can keep a running check against whatever source we're parsing. If the author writes tread, instead of thead, we have a standard by which we can determine that to be in error. When issues are unresolved, and we cannot find rules to match certain uses of grammar and vocabulary, we inform the author that their document is invalid.

I am by no means doing this science justice, however I hope that this serves - if nothing more - to be enough that you might find it within yourself to sit down and read the article referenced as the beginning of this answer, and perhaps sit down and study the various DTD's that we encounter day to day.

Sampson
  • 265,109
  • 74
  • 539
  • 565
1

The good news is here, you don't need to reinvent the wheel. The libxml library is used within PHP's DOMDocument extension and it's source code is available. Have a look there I suggest.

And btw., regular expressions are not always wrong, but you need to use them right others you go straight into hells kitchen, become a kitty serial killer or visit chutullu or how that guy is called. I suggest the following read therefore: REX: XML Shallow Parsing with Regular Expressions.

But if you do everything right, regular expressions can assist you a lot with parsing. It's just you should know what you do.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • Like I've stated, I'm not trying to reinvent the wheel, I'm just curious on how the wheel works. Also I tried to find the source for DOMDocument and couldn't find it. – Madara's Ghost May 05 '12 at 17:15
  • I linked the source of libxml. See as well the description of the architecture on their website, the project is quite large. And there is as well the htmlpurifier project which is a HTML parser written in PHP IIRC. – hakre May 05 '12 at 17:47