0

How to use the concrete syntax tree to parse a file and generate the abstract syntax tree?

I came across with concrete syntax trees on this blog post about ungrammar. But I can't wrap my head around on how to build the parser.

Steven Barragán
  • 1,144
  • 1
  • 11
  • 21
  • What language are you trying to parse? Rust itself? If so, is there some reason you don't want to use [syn](https://crates.io/crates/syn)? – eggyal Sep 13 '21 at 21:08
  • I'm building my own from scratch. CST seems like a good way to start. – Steven Barragán Sep 13 '21 at 22:38
  • Your question doesn’t make a great deal of sense. A syntax tree (whether concrete or abstract) is the parsed representation of some document in your language. If you already have the CST for a document, then that document must have already been parsed (and obtaining an AST would merely be a case of “lowering” the CST in a transformation pass). So “using the CST to parse a file” is just nonsense—and it’s not clear whether you’re asking how to build a parser in order to generate CSTs/ASTs, or how to build a tool that can lower a CST into an AST? – eggyal Sep 14 '21 at 01:58
  • If you are writing a parser for Rust, I would first look [here](https://github.com/rust-lang/wg-grammar) and try to scrape a standard EBNF grammar from that. The parser implementation is [here](https://github.com/rust-lang/rust/tree/master/compiler/rustc_parse/src), which you could try to scrape, but would be more difficult--the parser includes semantic predicates. The doc for ungrammar says it's a grammar to describe the CFG types, but does not encode constraints on that type system (e.g., precedence in expressions). So, I would not start there. – kaby76 Sep 14 '21 at 03:37
  • I forgot to mention...if you are looking for a grammar for Rust, and don't care how it was exactly derived, or whether it is up-to-date, then you could use [this](https://github.com/antlr/grammars-v4/tree/master/rust), in Antlr4 syntax, and convert it into the appropriate form for another parser generator, combinator, or hand implementation. – kaby76 Sep 14 '21 at 03:48
  • See my SO answer on how to implement recursive descent parsers: https://stackoverflow.com/a/2336769/120163 – Ira Baxter Nov 01 '21 at 03:45

1 Answers1

1

A concrete syntax tree is just a lossless representation of source code in tree form. It's basically a superset of an abstract syntax tree, as it contains the same information with the relative same structure, but with the extra "trivia" information that an abstract syntax tree would throw away.

If you're familiar with more traditional formal parsing techniques, you might also have heard it called just a "parse tree," which would be output by a non-actions-based parser generator, which you would typically then post-process to an AST more amenable to later compiler passes.

A CST is closer to the AST in that it typically matches the semantic structure of the language more than the lexical structure, but ultimately they're all the same basic idea of a structure, just representing slightly different views of the parsed language.

So whether you're parsing to the formal parse tree, a CST, or directly to an AST (or even an IR bytecode), none of this has any direct impact on what parsing techniques you use, just on what structure you build up while parsing.

So your question boils down to the opinion question of "how should I parse source code," which is quite an open question. Parser combinators tend to be popular in Rust, but even just fixed-lookahead recursive descent is quite powerful and simple to do.

CAD97
  • 1,160
  • 12
  • 29
  • You need to be careful about the phrase "lossless". Most CST/ASTs don't capture whitespace content or token formatting information, or even the radix of a literal. Without capturing *every* source character, you can't be "lossless" (e.g., regenerate the original source exactly). The AST/CST can be *isomorphic*, and that's useful if nobody is going to look at code regenerated from the tree. If people do look at regenerated code, its a lot harder to capture enough to satisfy them. See my SO answer on "Compiling an AST back to source code": https://stackoverflow.com/a/5834775/120163 – Ira Baxter Nov 01 '21 at 03:49
  • @IraBaxter: yes with ASTs. Part of the point of ASTs is that they throw away useless information. But the point of [CSTs/LSTs](https://www.oilshell.org/blog/2017/02/11.html) in general and rowan (the CST the ungrammar is used for) in particular is that they *are* fully lossless, and represent the input file exactly. I know some people use CST to refer to an AST that represents "pure sugar" differences, or a parse tree rather than an AST, but the way I learned to use CST was in reference to a lossless syntax tree representation of the exact source code. – CAD97 Nov 04 '21 at 20:24