Understanding the context data structure in Antlr4

Question

I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.

I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.

My grammar:

grammar pqlc;

// Lexer

//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';

INTEGER_LITERAL  : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;

LPAREN          : '(';
RPAREN          : ')';
LBRACK          : '[';
RBRACK          : ']';
DOT             : '.';
EQUAL           : '==';
LE              : '<=';
GE              : '>=';
GT              : '>';
LT              : '<';
ADD             : '+';
MUL             : '*';
AND             : '&&';
COLON           : ':';

IDENTIFIER    :   JavaLetter JavaLetterOrDigit*;
fragment JavaLetter    :   [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit    :   [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS  
    :  [ \t\r\n\u000C]+ -> skip  
    ;
COMMENT
    :   '/*' .*? '*/' -> skip
    ;

LINE_COMMENT
    :   '//' ~[\r\n]* -> skip
    ;


// Parser

//start_rule: query;

query :
      quant_expr
      | qexpr+
      | IDENTIFIER // order IDENTIFIER and qexpr+?
      | numeral
      | c_expr //TODO

      ;

c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;


qexpr:
         LPAREN query RPAREN bin_op_query? 
         // query bin_op query
         | IDENTIFIER  bin_op_query? // copied from query to resolve left recursion problem
         | numeral bin_op_query?  // ^
         | quant_expr bin_op_query? // ^
           |c_expr bin_op_query?
           // query.find(query)
         | IDENTIFIER  find_query? // copied from query to resolve left recursion problem
         | numeral find_query?  // ^
         | quant_expr find_query?
           |c_expr find_query?
           // query[query]
          | IDENTIFIER  array_query? // copied from query to resolve left recursion problem
         | numeral array_query?  // ^
         | quant_expr array_query?
           |c_expr array_query?

     // | qexpr bin_op_query // bad, resolved by quexpr+ in query 
     ;

bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query

find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;

quant_expr:
    quant id ':' query
          | QUERY LPAREN match RPAREN ':' query
          | REDUCE LPAREN IDENTIFIER RPAREN id ':' query
    ;

match:
         STDVECTOR LBRACK id RBRACK EQUAL cm
     | STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
     | STDSET '.''find' LPAREN cm RPAREN
     ;

cm:
    IDENTIFIER
  | numeral
   | c_expr //TODO
  ;

quant :
          EXISTS;

id :
     c_type IDENTIFIER
     | IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
   ;

numeral :
            INTEGER_LITERAL
        | DOUBLE_LITERAL
        ;
c_expr:
          C_EXPR
      ;

Now let's parse the following string:

double x: x >= c_expr

Visually I'll get this tree: tree

Let's say my visitor is in the visitQexpr(@NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).

My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules. ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.

I had more questions but in the time it took me to write the explanation I forgot them :< Thanks in advance.

A common solution is to visit an [AST (abstract syntaxic tree)](https://en.wikipedia.org/wiki/Abstract_syntax_tree) instead of visiting the parse tree... However this feature was [removed since antlr4](http://stackoverflow.com/questions/15823333/how-can-i-build-an-ast-using-antlr4). Maybe you can use a king of [symbol table](https://en.wikipedia.org/wiki/Symbol_table) to know if `x` is an identifier :) Good luck ! — NiziL, Jun 12 '14 at 14:08
If I fetch the payLoad() of the Child, the token id is in there (among other stuff). So the information is saved in the datastructure, there must be a proper way to easily access it. ctx has functions for my different terminal nodes like ctx.IDENTIFIER() but I can't tell what exactly they are doing. It seems that the return value of ctx.IDENTIFIER() is null if there are no terminal nodes and the text of the nodes otherwise. It still wouldn't tell me /which/ of the children if there are several is the terminal node though. I find the whole data structure so confusing :\ — user2323596, Jun 12 '14 at 14:15
When I want to know if a subrule/terminal node is being visited at some Context, I usually just check if this subrule/terminal is null (not being visited) or not null (is being visited). But I also find this rather hacky, than a nice and clean way to get what you want... — schauk11erd, Jun 12 '14 at 15:10
Why are there no visits for terminals? I use them to an extent... `public T visitTerminal(TerminalNode node)` with TerminalNode having a type representing the token of the Lexer. — Uwe Allner, Jun 13 '14 at 07:53

score 3 · Accepted Answer · answered Jun 18 '14 at 08:05

You can add labels to tokens and access them/check if they exist in the surrounding context:

id :
     c_type labelA = IDENTIFIER
     | labelB = IDENTIFIER 
   ;

You could also do this to create different visits:

id :
     c_type IDENTIFIER    #idType1 //choose more appropriate names!
     | IDENTIFIER         #idType2
   ;

This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.

I prefer the following approach though:

id :
        typeDef
     |  otherId
     ;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;

This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:

Use | only when all alternatives are parser rules.
Wrap each Token in a parser rule (like otherId) to give them "more meaning".
It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.

Understanding the context data structure in Antlr4

1 Answers1