How to specify beginning-of-line keywords in ANTLR grammars (which also works for the first input line)

Question

This is a question about the remaining problem of the solution proposed for another Stackoverflow question about beginning-of-line keywords.

I am writing an ANTLR4 lexer and parser for a programming language where something is a keyword in case it is the first non-whitespace token of a line. Let me explain this with an example. Suppose "bla" is a keyword then in the following example:

foo bla
    bla foo foo
foo bla bla

the second "bla" should be recognized as a keyword but the others shouldn't.

In order to achieve this I have defined the following simple ANTLR4 grammar:

grammar foobla;

// PARSER

main
    : line* EOF
    ;

line
    : indicator text*
    ;

indicator
    : foo
    | bla
    ;

foo: FOO ;
bla: BLA ;
text: TEXT ;

// LEXER

WHITESPACE: [ \t] -> skip ;

fragment NL: [\n\r\f]+[ \t]* ;
fragment NONNL: ~[\n\r\f] ;

// Indicators
FOO: NL 'foo' ;
BLA: NL 'bla' ;

TEXT: NONNL+ ;

This is similar to the answer given in How to detect beginning of line, or: "The name 'getCharPositionInLine' does not exist in the current context".

Now my question. This works fine, except in case the "bla" or "foo" keyword is used in the first line of the input program. I can think of 2 ways to solve this but I don't know how this can be achieved:

Use something like a BOF (beginning of file) token. However, I can't find such a concept in the manual
Use a hook to dynamically add a new line at the beginning of the input file before the parsing starts, preferably by specifying something in the g4 file itself. This I couldn't find either in the manual

I don't want to write an extra application/wrapper to add a new line to the input file just because of this.

Could you edit your question and add a bit more info to it? Does the lexer also produce NL tokens? Perhaps give a more realistic input example: IMO the problem with such simplified examples leave too much room for interpretation. — Bart Kiers, Jan 27 '21 at 12:49
@BartKiers, I have added the entire grammar. Note that this is just a simple toy language to explain the situation. — Paul Jansen, Jan 27 '21 at 23:17

score 0 · Answer 1 · answered Jan 27 '21 at 08:04

0

Here's another idea:

In your BLA lexer rule add a predicate which checks the end of the token stream (where the BLA token is not yet added) to see on which line the last non-whitespace token was. If that line differs from the current token line you know the BLA token is really a BLA token, otherwise set its type to IDENTIIFIER.

answered Jan 27 '21 at 08:04

Mike Lischke

48,925
16
119
181

Thanks. I need to sort out how to do this exactly. I let you know as soon as I have some results. – Paul Jansen Jan 27 '21 at 23:20
Unfortunately this is not going to work because there is no non-whitespace token before the first line... – Paul Jansen Jan 28 '21 at 10:30
Well, if there's no other content (whitespace or not) before the `BLA` token then it must be a keyword, no? :-D – Mike Lischke Jan 28 '21 at 16:07

How to specify beginning-of-line keywords in ANTLR grammars (which also works for the first input line)

1 Answers1