0

This is a question about the remaining problem of the solution proposed for another Stackoverflow question about beginning-of-line keywords.

I am writing an ANTLR4 lexer and parser for a programming language where something is a keyword in case it is the first non-whitespace token of a line. Let me explain this with an example. Suppose "bla" is a keyword then in the following example:

foo bla
    bla foo foo
foo bla bla

the second "bla" should be recognized as a keyword but the others shouldn't.

In order to achieve this I have defined the following simple ANTLR4 grammar:

grammar foobla;

// PARSER

main
    : line* EOF
    ;

line
    : indicator text*
    ;

indicator
    : foo
    | bla
    ;

foo: FOO ;
bla: BLA ;
text: TEXT ;

// LEXER

WHITESPACE: [ \t] -> skip ;

fragment NL: [\n\r\f]+[ \t]* ;
fragment NONNL: ~[\n\r\f] ;

// Indicators
FOO: NL 'foo' ;
BLA: NL 'bla' ;

TEXT: NONNL+ ;

This is similar to the answer given in How to detect beginning of line, or: "The name 'getCharPositionInLine' does not exist in the current context".

Now my question. This works fine, except in case the "bla" or "foo" keyword is used in the first line of the input program. I can think of 2 ways to solve this but I don't know how this can be achieved:

  • Use something like a BOF (beginning of file) token. However, I can't find such a concept in the manual
  • Use a hook to dynamically add a new line at the beginning of the input file before the parsing starts, preferably by specifying something in the g4 file itself. This I couldn't find either in the manual

I don't want to write an extra application/wrapper to add a new line to the input file just because of this.

Paul Jansen
  • 1,216
  • 1
  • 13
  • 35
  • Could you edit your question and add a bit more info to it? Does the lexer also produce NL tokens? Perhaps give a more realistic input example: IMO the problem with such simplified examples leave too much room for interpretation. – Bart Kiers Jan 27 '21 at 12:49
  • @BartKiers, I have added the entire grammar. Note that this is just a simple toy language to explain the situation. – Paul Jansen Jan 27 '21 at 23:17

1 Answers1

0

Here's another idea:

In your BLA lexer rule add a predicate which checks the end of the token stream (where the BLA token is not yet added) to see on which line the last non-whitespace token was. If that line differs from the current token line you know the BLA token is really a BLA token, otherwise set its type to IDENTIIFIER.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181