How to leave out unimportant nodes?

Question

I am using ANTLR 4.9.2 to parse a grammar that represents assembly instructions.

grammar IrohAsm;
main: line* | EOF;

line: (rangedec | instruction | comment)? EOL;

instruction: MNEMONIC firstoperand COMMA secondoperand;
rangedec : range assignment?;
firstoperand : range | mem | REGISTER;
secondoperand : range | mem | IMM | REGISTER;
range : IDENTIFIER OPENBRACKETS IMM CLOSEDBRACKETS;
assignment : EQUALS OPENCURL IMM (COMMA IMM)* CLOSECURL;
mem : AT IMM;
comment : '#' ~EOL*;

WHITESPACE : (' ') -> skip ;

// remember to append \n to input
EOL : '\n';

OPENCURL : '{';
CLOSECURL : '}';
OPENBRACKETS : '[';
CLOSEDBRACKETS : ']';
COMMA : ',';
EQUALS : '=';
AT : '@';
MNEMONIC : ('jmp' | 'add' | 'sub' | 'jez' | 'mov' | 'wrt' | 'get');
REGISTER: ('ab' | 'bb' | 'cb' | 'db');
IMM : DIGITS RADIX?;
RADIX : ('d' | 'b' | 'h');
DIGITS : [0-9]+;
IDENTIFIER: ([a-zA-Z0-9] | '$' | '_' | '\u00C0'..'\uFFFF')+ ;

The grammar works fine, but generates trees like the following;

Parse Tree Example

when given the following input:

mov ab,ab

As you can see, COMMA is included as one of the children of instruction. Its placement is important for the language, but I don't really care about it after parsing. Is there some way I could leave it off the final tree entirely? And if so, would this be a change to the grammar, or my code to parse the tree?

Removing Extraneous Nodes

My current code to get the tree:

CharStream inputStream = CharStreams.fromFileName("src/test/assembly/cool.asm");
IrohAsmLexer lexer = new IrohAsmLexer(inputStream);
IrohAsmParser parser = new IrohAsmParser(new CommonTokenStream(lexer));
ParseTree parseTree = parser.main();

I'd like to start here by asking: why do you want to remove those nodes? Do they harm you in any way? Why can't you just ignore them? — Mike Lischke, Oct 27 '21 at 08:43
I’ll “pile on” to Mike’s comment…. Once you see how ANTLR optimizes memory utilization and understand listeners and visitors well, these “extraneous” node rather disappear from view. Many people (myself included), when they see the tree structures, have an instinct to clear the clutter. That’s generally a good instinct, but you could go to a lot of work and find that you’ve saved yourself, just a minor annoyance of seeing a few “unnecessary” fields. — Mike Cargal, Oct 27 '21 at 12:01
Extra nodes result in a larger parse tree, which puts pressure on memory. In Antlr4, there are two ways around this, but both options are painful. (1) You delete them after the entire parse is done (using XPath/delete node traversal, standard Antlr4 listener or visitor, or your own cooked up traverser), or during the parse (which I haven't tried, but seems possible but dangerous). (2) Use `parser.BuildParseTree = false;`, and use the `ruleReturns` syntax to construct an alternative parse tree. The `^ ! ->`-syntax is not in Antlr4, but I am thinking of writing a generator to reintroduce. — kaby76, Oct 27 '21 at 12:09
"but I am thinking of writing a generator to reintroduce": once every now and then I'm thinking the same thing @kaby76! :) — Bart Kiers, Oct 27 '21 at 12:18
Thanks everyone for your help, I plan to write a translator using the listener pattern. Memory efficiency isn't really a priority here as it's for a hobby project, convenience and brevity of code is my main concern. The listener pattern seems especially well suited to my task. — Michael, Oct 27 '21 at 23:19
I started thinking you could play with lexer modes to exclude the "," from the token stream to the parser. But it blurs/crosses a line between lexer and parser to put this complexity into the lexer. I was considering having MNEMONICnn where nn was the operand count as the lexer could be told how many operands to expect. Then the tokens could be retyped as MNEMONIC with ->type(MNEMONIC) so the parser could just be getting MNEMONIC and Operands with the comma removed. — Ross Youngblood, Nov 17 '21 at 19:34

score 5 · Accepted Answer · answered Oct 27 '21 at 06:23

Your question boils down to: "how can I convert my parse tree to an abstract syntax tree?". The simple answer to that is: "you can't" :). At least, not using a built-in ANTLR mechanism. You'll have to traverse the parse tree (using ANTLR's visitor- or listener mechanism) and construct your AST manually.

The feature to more easily create AST's from a parse tree often pops up both in ANTLR's Github repo:

https://github.com/antlr/antlr4/issues/2428

as well as on stackoverflow:

Ira Baxter · Answer 2 · 2021-10-27T15:51:58.400

For ANTLR, as Bart says, you can't do it without doing it yourself. Essentially you have to write custom code to walk over the CST and construct a custom AST.

It doesn't have to be this way. You can construct parser generators that automatically build trees that:

Leave out nodes that don't carry any values (e.g., "comma", EOL)
Remove chains of unary productions (e.g., "firstoperand" and "secondoperand; these are short chains in your example but expression trees tend to have long ones)
Construct node-lists whenever right or left leaning trees are generated from grammar rules that represent lists, replacing the "spine" nodes and the list-ending node if any with a single list node. (Your example doesn't have any of these). This makes lists smaller and easier to manipulate.

The result gives something very close to classic ASTs without any manual effort; the resulting trees are often 30-50% of the size of the CSTs from which they are automatically derived.

This matters when

working with large grammars (you don't want have define the mapping of every single node by hand for 3500 rules)
working with grammars that are rapidly changing (as you try get a first working draft right)
building tools that process very large trees (representing very large [systems] of source code, say the Linux source tree) that have to touch billions of nodes to compute the answer, by avoiding lots of accesses to cache lines that would contain concrete nodes or list spines.

I'd provide the name of tool that does this that I designed and built, but some SO people hate it when I do so. You can check my profile.

score 1 · Answer 3 · answered Oct 27 '21 at 23:16

For anyone who is stuck on this, Bart Kiers's answer is a great jumping off point, some resources I found that explain the Listener/Visitor patterns well are:

This presentation from the CS dept at Taylor University and accompanying GitHub example(specifically the parts about the visitor pattern)
The Definitive ANTLR4 reference's example "4.3 Building a Translator with a Listener"

How to leave out unimportant nodes?

3 Answers3