1

For example, input = '(1+2)*3' .

tree is like that '(expr (expr ((expr (expr 1) + (expr 2)) ))*(expr 3))'

And then, I would like to hide or delete the '(' and ')' in the tree , they are no needed any more. I try to make it , but didn't.

expr : ID LPAREN exprList? RPAREN
     | '-' expr
     | '!' expr
     | expr op=('*'|'/') expr
     | expr op=('+'|'-') expr
     | ID
     | INT
     | LPAREN expr RPAREN //### Parens Here ####
     ;
 LPAREN : '(' ;
 RPAREN : ')' ;

What I want is** NOT** the following.

PAREN : ( '(' | ')' ) -> channel(HIDDEN)
fuis
  • 11
  • 3
  • What exactly do you mean by "removing parens from the tree"? Do you want to output the input without them, e.g `(1+2)*3 => 1+2*3` or do you want to accept input without parens? – Adrian Leonhard Apr 05 '15 at 10:46
  • @Adrian Leonhard No. I just try to make AST more simple, remove extra nodes from AST, not to change its own meaning. – fuis Apr 05 '15 at 12:26
  • @fuis what ANTLR generates for you is actually a [CST](http://en.wikipedia.org/wiki/Parse_tree), and you want an [AST](http://en.wikipedia.org/wiki/Abstract_syntax_tree). Don't confuse them! You can build an AST from the CST that's generated for you using a visitor but you'll have to define your AST node classes yourself. – Lucas Trzesniewski Apr 05 '15 at 13:34
  • @Lucas Trzesniewski Right.Now I realize that diff between CST and AST, thank you very much. – fuis Apr 06 '15 at 02:00

2 Answers2

2

Standard parser generator schemes separate parsing from tree building.

This allows one to design custom actions to build an AST, and fine tune its structure to the targeted langauge (including leaving out concrete syntax such as "parentheses").

The price is that one must not only specify the grammar, but one must also specify the rules for building the AST. And that makes defining a "grammar + tree builder" about twice as much work as just defining a grammar. When your grammars are tiny, this doesn't matter, but usually tiny grammars means "toy problem". With big real production gnarly grammars, this matters a lot; there's usually a bunch of initial churn in trying to get such grammars right and the AST building stuff just gets in the way during this phase. Clever people delay adding AST building rules till the churn phase is over, but that only partially works, and it turns out that you may want to reshape the grammar based on AST you want to build, so this delay actually increases the churn somewhat. One also pays a maintenance cost; if your grammar has any scale, you will change it, and then the AST building part must change, too.

My company builds a tool, the DMS Software Reengineering Toolkit, which contains a parser generator. We decided, the first to do so AFAIK, some 20 years ago, that this extra AST building step was too much work for the benefit for the many big grammars we expected to (and did) build. So we designed DMS to automatically build a concrete syntax tree as it parsed. Voila, write a grammar, get a parser and the tree is free. This decision has turned out to be a really good one.

The price is the final tree retains all the concrete syntax, e.g., the parentheses. While it may not look elegant, it turns out that this does not matter much in practice when manipulating trees (inspecting, traversing, analyzing, modifying, ...). We've traded a bit of inelegance for much easier tree building and grammar maintenance.

The ANTLR guy(!) decided for ANTRL4, unlike his previous ANTLR1/2/3 systems, to follow our lead and switch to building "ASTs" automatically from the grammar, as concrete syntax trees. (I don't know if you can actually write your own AST building rules to override the built-in feature for ANTLR4. Comments on this answer suggest that the way to get an AST from ANTLR4 is to walk the CST and build what you want. I'm not keen on that solution; it strikes me as paying the price for building and managing the AST, and also having the parsing overhead [time and space] of building the CST. If you only build small trees, maybe you don't care. For DMS, we regularly read thousands of files for processing together; space and time matter!)

For some discussion on how to make this a bit more elegant (effectively even more AST like), see my SO answer on ASTs vs. CSTs

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • In the linked answer you say DMS "compresses" the CST. Granted, this can be a nice compromise between building an AST and working on the raw CST. But since ANTLR4 only generates a raw CST and associated visitors, I'd still recommend building a separate AST for any serious work. IMO the maintenance cost is smaller when you have to change the grammar in the end, because the AST changes much less than the CST on grammar refactorings. – Lucas Trzesniewski Apr 05 '15 at 11:47
  • @Lucas: So we disagree on the cost of building/maintaining the AST; we have extremely good experience here on the maintenance side, esp. when you understand that changing the grammar automatically changes the CST for you. Maybe your procedural code knows the AST structure in excruciating detail, and then it might matter. DMS avoids this to large extent by providing surface-syntax based patterns; you mostly don't have to know the nodes. This also begs the question: given ANTLR4, *how* do you propose to build the AST? Walk the CST and do it? Or does ANTLR4 still allow AST-building rules? – Ira Baxter Apr 05 '15 at 11:52
  • This is how I do it: write one CST visitor which creates the AST, this is easily done with what ANTLR provides. Then, write all your other visitors over the AST. When the grammar changes you have to tweak the CST visitor but most of the time you can get away with leaving the AST visitors alone. Of course this depends heavily on the nature of the work and the grammar. Anyway, I don't doubt you had good experience with your tool, I'm just recommending this approach for use with ANTLR since I think what you recommend for DMS won't work as well with ANTLR in the long run. – Lucas Trzesniewski Apr 05 '15 at 11:59
  • @LucasTrzesniewski: agreed, ANTLR isn't likely to work with a compressed CST. But one can still work with it. There is a real philosophical commitment here: if you insist on ASTs you have to build and maintain them. Now the only question is what works best in practice. – Ira Baxter Apr 05 '15 at 12:08
  • Exactly, and I did both. I've also used AST nodes which reference the related CST nodes to get the most out of both approaches. Everything depends on the task at hand. – Lucas Trzesniewski Apr 05 '15 at 12:14
  • Hi Ira, I didn't get the main point in your answer (despite that its very informative -- +1 from me). Are you trying to say that instead of trying to extract AST, we should directly work on the CST? (Seems like a good suggestion, just want to ensure I'm getting it right). – Jus12 Jan 12 '17 at 11:55
  • You got it. I've been building *one* tool for analyzing/transforming source code using "ASTs" for the last 20 years. It really uses a *compressed* CST, giving 99% of the value of ASTs with 0% of work to build ASTs. The 1% advantage of ASTs is they can supress parentheses in the tree (but you have to do work to make that happen). – Ira Baxter Jan 12 '17 at 14:06
0

To suppress useless tokens in the tree, use '!' symbol after corresponding tokens:

//remove ',' comma from the output
list: LISTNAME LISTMEMBER (','! LISTMEMBER)*;  

from http://meri-stuff.blogspot.com/2011/09/antlr-tutorial-expression-language.html

john k
  • 6,268
  • 4
  • 55
  • 59