Custom ANTLR grammar not working for every input

Question

I am trying to write a grammar for our custom rule engine which uses ANTLR (for parsing) and Pentaho Kettle (for executing the rules)

Valid inputs for the parser would be of the type:
(<Attribute_name> <Relational_Operator> <Value>) AND/OR (<Attribute_name> <Relational_Operator> <Value>)
i.e. PERSON_PHONE = 123456789

Here is my grammar:

grammar RuleGrammar;
options{
language=Java;
}

prog                : condition;

condition
                                :  LHSOPERAND RELATIONOPERATOR RHSOPERAND
                                ;

LHSOPERAND
                                :  STRINGVALUE
                                ;

RHSOPERAND
                                :  NUMBERVALUE    |
                                   STRINGVALUE
                                ;


RELATIONOPERATOR
                                :   '>'    |
                                     '=>'  |
                                     '<'   |
                                     '<='  |
                                     '='   |
                                     '<>'
                                ;

fragment NUMBERVALUE
                              : '0'..'9'+
                              ;

fragment STRINGVALUE
                              :  ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_')*
                              ;


fragment LOGICALOPERATOR
                              :  'AND' |
                                 'OR'  |
                                 'NOT'
                              ;

The issue I am facing is comparing against string value i.e. PERSON_NAME=1 would pass the grammar, but the value PERSON_NAME=BATMAN does not work. I am using ANTLRWorks and on debugging for PERSON_NAME=BATMAN, I get a MismatchTokenException for the RHS value.

Can anyone please guide me where I am going wrong?

Are you sure this is the version you're executing? I'm not an ANTLR user, but I find it hard to see how this grammar could fail for that input. — 500 - Internal Server Error, Feb 21 '12 at 18:05
@Internal Server Error: If I remove `NUMBERVALUE |` and only keep `StringValue`, then ANTLRWorks throws an error `The following token definitions can never be matched because prior tokens match the same input: RHSOPERAND`. The strange part is if I replace `LHSOPERAND RELATIONOPERATOR RHSOPERAND` with `LHSOPERAND RELATIONOPERATOR LHSOPERAND`, it works fine — name_masked, Feb 21 '12 at 18:16

Bart Kiers · Accepted Answer · 2012-02-21T21:02:34.837

BATMAN is being tokenized as a LHSOPERAND token. You must realize that the lexer does not take into account what the parser "needs" at a particular time. The lexer simply tries to match as much as possible, and in case 2 (or more) rules match the same amount of characters (LHSOPERAND and RHSOPERAND in your case), the rule defined first will "win", which is the LHSOPERAND rule.

EDIT

Look at it like this: first the lexer receives the character stream which it converts in a stream of tokens. After all tokens have been created, the parser receives these tokens which it then tries to make sense of. Tokens are not created during parsing (in parser rules), but before it.

A quick demo of how you could do it:

grammar RuleGrammar;

prog
 : condition EOF
 ;

condition
 : logical
 ;

logical
 : relational ((AND | OR) relational)*
 ;

relational
 : STRINGVALUE ((GT | GTEQ | LT | LTEQ | EQ | NEQ) term)?
 ;

term
 : STRINGVALUE
 | NUMBERVALUE
 | '(' condition ')'
 ;

GT          : '>';
GTEQ        : '>=';
LT          : '<';
LTEQ        : '<=';
EQ          : '=';
NEQ         : '<>';
NUMBERVALUE : '0'..'9'+;
AND         : 'AND';
OR          : 'OR';
STRINGVALUE : ('a'..'z' | 'A'..'Z' | '_')+;
SPACE       : ' ' {skip();};

(note that EQ and NEQ aren't really relational operators...)

Parsing input like:

PERSON_NAME = BATMAN OR age <> 42

would now result in the following parse:

enter image description here

Again, thank you very much. I am trying to understand what you have mentioned above. So, when you say "You must realize that the lexer does not take into account what the parser "needs" at a particular time", do you mean that even though I have mentioned `RHSOPERAND`, for the input `PERSON_NAME = BATMAN`, the FIRST closest token match is actually `LHSOPERAND` and that is the reason it assumes the lexer rule as `LHSOPERAND RELATIONOPERATOR LHSOPERAND` — name_masked, Feb 21 '12 at 19:12

Custom ANTLR grammar not working for every input

1 Answers1

EDIT

Linked