0

I'm writing a parser for a language that looks like the following:

L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>

That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).

My current lexer looks like:

// Parser rules
commands      : command*;
command       : LINE_NUM? keyword NEWLINE
              | LINE_NUM? IDENTIFIER NEWLINE;
keyword       : KEYWORD_A | KEYWORD_B | ... ;

// Lexer rules
fragment INT  : [0-9]+;
LINE_NUM      : 'L' INT;
KEYWORD_A     : 'someKeyword';
KEYWORD_B     : 'reservedWord';
...
IDENTIFIER    : [a-zA-Z_][a-zA-Z0-9_]*

However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.

Is there a way to properly tokenize this input using an ANTLR grammar?

Harrison Paine
  • 611
  • 5
  • 14

1 Answers1

1

You need to add a semantic predicate to IDENTIFIER:

IDENTIFIER
  : {_input.getCharPositionInLine() != 0
      || _input.LA(1) != 'L'
      || !Character.isDigit(_input.LA(2))}?
    [a-zA-Z_] [a-zA-Z0-9_]*
  ;

You could also avoid semantic predicates by using lexer modes.

//
// Default mode is active at the beginning of a line
//

LINE_NUM
  : 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
  ;

KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
  : ( 'L'
    | 'L' [a-zA-Z_] [a-zA-Z0-9_]*
    | [a-zA-KM-Z_] [a-zA-Z0-9_]*
    )
    -> pushMode(NotBeginningOfLine)
  ;
NL : ('\r' '\n'? | '\n');

mode NotBeginningOfLine;

  NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
  NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
  NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
  NotBeginningOfLine_IDENTIFIER
    : [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
    ;
Sam Harwell
  • 97,721
  • 20
  • 209
  • 280
  • Both methods look good, thanks! Any considerations I should take into account that would push me towards one or the other? – Harrison Paine Apr 01 '14 at 18:57
  • @HarrisonPaine The lexer interpreter isn't able to evaluate semantic predicates, but lexers in combined grammars can't have multiple modes. If it were me, I would use multiple modes and since I always separate my lexers and parsers anyway. – Sam Harwell Apr 01 '14 at 20:35
  • I implemented the second approach and realized that I was actually facing a different problem, that wasn't evident in the simplified problem. The actual language is nested in another format, which provides a listing of all defined identifiers. Using the method here: http://stackoverflow.com/a/6108549/2483451 I can just tokenize each identifier exactly, saving a lot of headaches. Still, thanks for answering the question as I posted it; you definitely helped get me out of a stall. – Harrison Paine Apr 03 '14 at 13:14
  • @280Z28: Is it planned to remove the restraint to disallow multiple lexer modes in combined lexer/parser grammars? Is there any advantage (besides simplicity) from using a combined grammar? – Onur Apr 04 '14 at 07:19
  • @Onur I never use combined grammars, because they allow you to accidentally define new literal tokens in parser rules, quickly leading hard-to-find bugs. – Sam Harwell Apr 04 '14 at 13:25
  • @280Z28: Ok, but besides being able to create hard-to-find bugs, is there any advantage in using combined grammars? – Onur Apr 04 '14 at 13:27