2

As a developer, and I'm certain I'm far from alone here, I'm always curious to understand what's "under the hood". DOM parsers are one of the list-toppers of this curiosity for me. We all know the famous post. I have even hacked together a bit of an "O RLY?", out of both temporary necessity and curiosity.

However my need to meet the man-behind-the-curtain remains unmet. How do DOM parsers, or any structured document parsers for that matter, parse documents? As far as my intermediate web application developer understanding can muster, it's a combination of recursive string parsing and state-keeping logic, not unlike my own hackish attempt.

Magicians should never reveal their secrets, but seriously, where is he hiding the rabbit?

Community
  • 1
  • 1
Dan Lugg
  • 20,192
  • 19
  • 110
  • 174

1 Answers1

4

There's a well-developed theory of parsing, and untold numbers of tools to support it. In general, you look at each character, one at a time, and decide when the characters you've made so far constitute a token. Then you look at the series of tokens, and decide when the sequence of tokens constitute a higher-level grammatical construct -- in this case, an HTML element. As you recognize constructs, you build a tree of nodes to represent them -- in this case, the DOM tree.

So are you familiar with context-free grammars, and compiler-compilers like yacc, bison, and their more modern counterparts? If you understand those, a DOM parser shouldn't be a mystery.

Ernest Friedman-Hill
  • 80,601
  • 10
  • 150
  • 186
  • **Thanks Ernest Friedman-Hill;** I am not familiar with context-free grammars. Given that, I believe this is a good place for me to start reading. The concept of string tokenization is not completely foreign to me, though implementation details have eluded me. If you have any resources to suggest, I'd be happy to hear them. – Dan Lugg Apr 03 '11 at 20:19
  • To elaborate on my question a bit though; Are regular expressions (or other similar text matching technologies) a suitable candidate for tokenization? Are they used in any production libraries for this purpose? – Dan Lugg Apr 03 '11 at 20:31
  • Generations of people have learned from "The Dragon Book:" http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 . It's kind of pricey, though. There are plenty of lesser resources available here on the Internets: try the yacc/lex HOWTO: http://ds9a.nl/lex-yacc/cvs/lex-yacc-howto.html . – Ernest Friedman-Hill Apr 03 '11 at 20:35
  • To answer your elaboration: yes, absolutely. Regexes are used in tokenization, the recognition of local patterns like "'<' followed by a sequence of letters followed by '>'" . But they're not powerful enough to recognize arbitrarily nested HTML elements; to do that, you can use similar underlying technology -- at least conceptually -- but at a higher level, using tokens as the fundamental entities instead of characters. A parser is to a token stream as a regular expression is to a character stream -- if that makes any sense. – Ernest Friedman-Hill Apr 03 '11 at 20:39
  • Excellent, thank you very much for your suggestions Ernest :) I'll be looking into "*The Dragon Book*" if this interest evolves from curiosity to full-blown learning endeavor. – Dan Lugg Apr 03 '11 at 20:44