1

Possible Duplicate:
If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?

My question is simple: How do current DOM parsers actually parse the DOM from a string (XML, HTML, or otherwise)?

I know you shouldn't parse html with RegEx, but couldn't a DOM parser use RegEx to match patterns for open/close tags? Or, is there a good once-over algorithm for parsing the provided string as a character array?

Community
  • 1
  • 1
zzzzBov
  • 174,988
  • 54
  • 320
  • 367
  • Depends on the parser implementation doesn't it? – Ed S. Jan 09 '11 at 07:00
  • But to answer this exact question quickly: Most propably do use regexes - but only **for tokenization** (e.g. recognizing opening and closing tags). –  Jan 09 '11 at 07:04
  • I missed that question somehow, and I've voted to close this copy down. – zzzzBov Jan 09 '11 at 07:08

2 Answers2

4

Look at this:

alt text

Here is a good Example

Community
  • 1
  • 1
Naveed
  • 41,517
  • 32
  • 98
  • 131
0

Well, you could start with a basic approach along the lines of:

http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c

And then just expand it to store everything into the full DOM tree structure.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466