Parsing: how can I generate a more concise AST?

Question

I'm developing a parser for learning purposes that follows some user-defined rules to create ASTs. During AST generation, however, I can't find a way to abstract the tree - all tokens/intermediate node types are maintaned in the final tree structure.

It's easier to explain with an example:

User-defined structures:

struct ASTExpr { };
struct ASTNumber : ASTExpr  { int value; };
struct ASTAddition : ASTExpr { ASTExpr* rhs, lhs; };

User-defined rules:

[0..9]              -> ASTNumber
ASTExpr [+] ASTExpr -> ASTAddition
ASTNumber           -> ASTExpr
ASTAddition         -> ASTExpr

Example input:

1 + 5

Example parsing process/output:

          (ASTExpr)
              |
        (ASTAddition)
              |
      /----------------\
      |       |        |
  (ASTExpr)  [+]   (ASTExpr)
      |                |
 (ASTNumber)      (ASTNumber)
      |                |
     [1]              [5]

As you can see the tree is very complex and not very "abstract". The (...) nodes are classes, the [...] nodes are tokens.

This is what I would prefer:

        (ASTAddition)
              |
      /----------------\
      |                |
 (ASTNumber)      (ASTNumber)

In the desired version of the tree, the tokens and the intermediate ASTExpr nodes are not present.

How can I make my AST tree more "abstract"? Do I have to change the ruleset? What is usually done to ensure that the final AST contains less noise?

Just leave out the unit rules, or at least the ones which don't do anything. Since you Addition class is derived from your Expr class, there is no need for the rule which reduce `ASTAddition` into `ASTExpr` to create a new AST node. — rici, Apr 29 '14 at 13:25
@rici: then I would need to create some kind of rule that determines whether a node is an expression (example: `return node.type == ASTAddition || node.type == ASTNumber;`) - is this the correct thing to do? Also, would this solution help me avoiding tokens in the AST? — Vittorio Romeo, Apr 29 '14 at 13:34
An ASTExpr node is always an expression, so I don't see why you would need such a rule. For simple purposes, class inheritance works just fine. (Of course, you'll need some virtual methods.) — rici, Apr 29 '14 at 14:24
This answer about differences between ASTs and CSTs might be helpful, because it discusses a standard scheme for producing "concise" ASTs: http://stackoverflow.com/a/1916687/120163 — Ira Baxter, May 01 '14 at 02:36

score 1 · Accepted Answer · answered Apr 29 '14 at 13:38

In the end, there's no unique grammar. Consider even the simple */+- grammar; do you want to specify that the two sides of + can be numbers, products or ratios? It adds up quickly:

1+1
1+1*1
1+1/1
1*1+1
1*1+1*1
1*1+1/1
1/1+1
1/1+1*1
1/1+1/1

Using an intermediate rule keeps the grammar smaller at the expense of a larger AST. But as you noted, you can trivially transform the AST, in effect transforming the underlying grammar as well.

You could cut out pretty much any class at will. [1] ASTAddition [5] would be a valid AST, too, for a grammar with a quite complex ASTAddition but few other rules. What makes sense for you?

Parsing: how can I generate a more concise AST?

1 Answers1