Is there a way to make ANTLR4 use enums for generated tokens?

Question

In ANTLR4 the generated lexer in Java contains a public field for each token where the type of the field is a simple 'int'. Is there a reason why ANTLR4 does not use enums instead, or is there an option to make it use enums?

This is a simplified example off the top of my head

x.g4

A: 'a';
B: 'b';

XLexer.java

public class XLexer extends Lexer{
   public static final int A = 1, B = 2;
}

I would prefer for XLexer to instead contain

public class XLexer extends Lexer{
  public static enum Token{
    A(1), B(2)
  }
}

This is useful for debugging purposes when dumping tokens. Right now the token name is not printed, instead only the integer representation is provided.

[@-1,0:0='a',<1>,1:0]

A more readable version would have <A> instead of <1>

[@-1,0:0='a',<A>,1:0]

It has been discussed before here: http://www.antlr3.org/pipermail/antlr-interest/2008-May/028432.html — Bart Kiers, Apr 02 '15 at 20:00
In light of that discussion, its probably simplest for the generated lexer class to contain an array that maps the integer values of the tokens to the string name, such as is already done for modeNames, and ruleNames. There is a tokenNames array but it contains a seemingly random set of characters. Maybe this is just a bug. — jonr, Apr 02 '15 at 23:35

score 4 · Answer 1 · answered Apr 04 '15 at 22:57

4

To convert an int token type to its symbolic value, just use

String tokenName = YourLexer.VOCABULARY.getSymbolicName(type);

answered Apr 04 '15 at 22:57

GRosenberg

5,843
2
19
23

score 1 · Answer 2 · answered Apr 02 '15 at 19:44

Here is my current workaround. I create a custom token and provide a TokenFactory to the XLexer via

lexer.setTokenFactory(new MyTokenFactory());

And I override the toString() method in my token class.

public class MyToken extends Token{
  @Override
  public String toString(){
    StringBuilder out = new StringBuilder();

    out.append("[");
    out.append("'").append(getText()).append("'");
    out.append(" type ").append(getName()); //getName() is implemented by this class

    int start = getCharPositionInLine();
    int end = start + getText().length();
    out.append(" at ").append(getLine()).append(":").append(start).append("-").append(end);
    out.append("]");

    return out.toString();
}

Where instead of showing the integer for the type the class uses getName() to convert the integer to a string.

// inside the token class
private String getName(){
   switch (getType()){
     case XLexer.A: return "A";
     case XLexer.B: return "B";
     default: throw new RuntimeException("unknown token " + getType());
  }
}

This produces the following output

['A' type A at 1:5-6]

This solution is somewhat brittle in that getName() has to be updated to remain in sync with the current tokens defined by the g4 file. There is no way to enforce this property, as the compiler cannot know if all the token types are handled in the switch inside getName().

score 1 · Answer 3 · edited May 23 '17 at 10:10

1

Reason why ANTLR4 uses ints instead of enums are simplicity and performance.

For debugging purposes, you may modify string-representation of tokens as follows:

Create your own implementation of token, extending CommonToken. Define the toString() method as you like.
Create a TokenFactory implementation, which returns the tokens of your custom type.
Set token factory for lexer and for parser.

See also:

How do I use custom tokens and contexts in ANTLR 4 on StackOverflow
CommonToken toString improvement on GitHub

EDIT, addressing the problem you've mentioned in your answer.

To avoid keeping token names in sync with .g4 manually, you may build a mapping from XLexer dynamically using reflection.

edited May 23 '17 at 10:10

Community

1
1

answered Apr 02 '15 at 19:46

Alex Shesterov

26,085
12
82
103

1

Can you say more about simplicity and performance, and how enum's don't satisfy these properties? – jonr Apr 02 '15 at 19:56
Reflection will not work as there are multiple kinds of fields in the XLexer class with that are declared as 'public static final int'. Only a subset of these are the token types. For example, lexer modes become int fields and have values that overlap the token values. Suppose there was an additional lexer mode in the g4 grammar above named ZZ. The XLexer class would have 'int A = 1; int ZZ = 1;' – jonr Apr 02 '15 at 21:44
@jonr, as I understand, you are using ANTLR extensively and define complex grammars. Did you try [ANTLRWorks2](http://tunnelvisionlabs.com/products/demo/antlrworks) for debugging? – Alex Shesterov Apr 03 '15 at 09:41

Is there a way to make ANTLR4 use enums for generated tokens?

3 Answers3