4

After parsing HTML or XML file, we can get the DOM tree.

After parsing C, C++, or JavaScript, we can get the Syntax tree.

Note that the syntax tree is constructed based on the context-free grammar which specifies a valid C/C++/JS program.

But it seems the DOM tree is just a pure hierarchy structure specified only by the HTML/XML file. Is that true? Is that the reason that Schema Validation has been done after parsing? What is the fundamental difference between these two kinds of parse trees?

Ravindra S
  • 6,302
  • 12
  • 70
  • 108
JackWM
  • 10,085
  • 22
  • 65
  • 92
  • My understanding is that a DOM tree (at least the one MS offers for C#) is a tree but with the rather large chunks, e.g., "statement" as a leaf. With such large chunks, people can afford to be sloppy with the leaves and still do some useful work (e.g., "show class diagrams", which is modelled well enough). But since fine detail is absence, you can't do deep reasoning about the code. – Ira Baxter May 15 '12 at 23:10
  • Does you mean abstract syntax tree (AST) for syntax tree? Does the word parsing imply a strict definition of parsing as in parsing a context free grammar or does parsing have two different meanings for either parsing HTML and parsing a grammar? – Guy Coder May 16 '12 at 10:01
  • Syntax tree here can be either abstract (AST) or concrete. Yes, the "parsing" for HTML/XML has the different meaning with the "parsing" in general programming languages, like C/C++, Java. – JackWM May 24 '12 at 20:04

2 Answers2

1

Thank you for Ira Baxter and Guy Coder's interests.

I re-searched for a while, and compared these two cases. My impression is like this:

The "parsing" for XML can be either "validating parsing" or "non-validating parsing". For the later one, the parser does not check its syntax against the Document Type Definition (DTD) file. This parser only produces the hierarchy of the elements in the XML file. So it is lighter than the "validating parsing".

The "parsing" for C/C++/Java generates the syntax tree based on its context-free grammar. So, informally, it is more like the "validating parsing".

PS: I am not an expert, so welcome any comments if you found my understanding is not correct.

JackWM
  • 10,085
  • 22
  • 65
  • 92
  • DOM trees used by browsers for HTML model the concrete syntax all the way down to the individual tags. That's essentially a syntax tree for the HTML, but not for embedded scripting languages which are just represented as text blobs. See my comment on the question for DOM trees available to C# programs. – Ira Baxter May 24 '12 at 20:26
  • It seems like the DTD (or XML schema) specifies "extra grammar" for your specific XML document. But XML just by itself already has a grammar: start and end tags... etc. This makes XML quite flexible and _extensible_. Compare this to JSON schema + JSON data and JSON's own grammar. – CMCDragonkai Dec 07 '17 at 04:05
1

Like any other language, XML is described by a grammar. XML's grammar is rather simple (start-tags, end-tags, correct nesting). So the syntax tree might seem simple as well (just an hierarchy of elements). An XML schema is another grammar that describes an XML file's content.

So basically it's two parsers being invoked after each other. The first one verifies that all start-tags have an end-tag and that the nesting is right.

The second parser verifies that the XML file's content is structured according to the schema (grammar).. like that an element named "B" can only be contained within an element named "A".

This shouldn't be compared to parsing programming languages like C since you cannot change a programming language's syntax. If-statements can only appear within function bodies, not outside and you cannot change that. However in XML you can specify that "B"-elements can only appear within "A"-elements, or that "A"-elements can only appear within "B"-elements.. all by specifying the grammar of your XML file's content in form of a schema.

stmax
  • 6,506
  • 4
  • 28
  • 45
  • That is a nice statement of XML's "two-level" grammar. Also, we can view that the correct-nesting is implicitly defined/required in the DTD. I think the functionality of DTD is kind of like that of CFG. – JackWM May 25 '12 at 00:28