2

I'm trying to code a context-sensitive lexer rule using ANTLR but can't get it to do what I need. The rule needs to match 1 of 2 alternatives based on characters found in the beginning of the rule. Below is greatly simplified version of the problem.

This example grammar:

lexer grammar X;

options
{
  language = C;
}

RULE :
  SimpleIdent {ctx->someFunction($SimpleIdent);}
  (
    {ctx->test != true}?
     //Nothing
  | {ctx->test == true}?
     SLSpace+ OtherText
  )
  ;

fragment SimpleIdent  : ('a'..'z' | 'A'..'Z' | '_')+;
fragment SLSpace    : ' ';
fragment OtherText :  (~'\n')* '\n';

I would expect the lexer to exit this rule if ctx->test is false, ignoring any characters after SimpleIdent. Unfortunately ANTLR will test the character after SimpleIdent before the predicate is tested and thus will always take the second alternative if there is a space there. This is clearly shown in the C code:

// X.g:10:3: ({...}?|{...}? ( SLSpace )+ OtherText )
{
    int alt2=2;
    switch ( LA(1) )
    {
    case '\t':
    case ' ':
        {
            alt2=2;
        }
        break;

    default:
        alt2=1;
    }

    switch (alt2)
    {
    case 1:
        // X.g:11:5: {...}?
        {
            if ( !((ctx->test != true)) )
            {
                    //Exception
            }

        }
        break;
    case 2:
        // X.g:13:5: {...}? ( SLSpace )+ OtherText
        {
            if ( !((ctx->test == true)) )
            {
                   //Exception
            }

How can I force ANTLR to take a specific path in the lexer at runtime?

1 Answers1

2

Use a gated semantic predicate instead of a validating semantic predicate 1. A validating predicate throws an exception if the expression validates to false. And let the "Nothing alternative" be the last to match.

Also, OtherText also matches what SLSpace, making SLSpace+ OtherText ambiguous. Simply remove SLSpace+ from it, or let OtherText start with something other than a ' '.

I'm not that familiar with the C target, but this Java demo should work just fine for C (after translating the Java code, of course):

grammar T;

rules
 : RULE+ EOF
 ;

RULE
 : SimpleIdent {boolean flag = $SimpleIdent.text.startsWith("a");}
   ( {!flag}?=> OtherText
   |            // Nothing
   )
 ;

Spaces 
 : (' ' | '\t' | '\r' | '\n')+ {skip();}
 ;

fragment SimpleIdent : ('a'..'z' | 'A'..'Z' | '_')+;
fragment OtherText   : (~'\n')* '\n';

If you'd now parse the input:

abcd efgh ijkl mnop
bbb aaa ccc ddd

you'll get the following parse:

enter image description here

I.e. whenever a RULE starts with a lower case "a", it doesn't match all the way to the end of the line.

1 What is a 'semantic predicate' in ANTLR?

Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • This works for the simplified case in my question. Unfortunately in a more complex rule(one with DFA prediction functions), ANTLR converts the gated semantic predicate to a validating semantic predicate. I can't seem to force ANTLR to always take a path based on a condition determined at runtime. –  Aug 08 '12 at 15:54
  • @Adam12, in that case, consider providing a rule that is a better representation of your grammar. – Bart Kiers Aug 08 '12 at 16:52
  • I see what ANTLR is doing now. It hoists the predicates into the prediction function, but checks them again in the rule function. –  Aug 09 '12 at 17:27