2

I need to define a language-parser for the following search criteria:

CRITERIA_1=<values-set-#1> AND/OR CRITERIA_2=<values-set-#2>;

Where <values-set-#1> can have values from 1-50 and <values-set-#2> can be from the following set (5, A, B, C) - case is not important here.

I have decided to use ANTLR3 (v3.4) with output in C# (CSharp3) and it used to work pretty smooth until now. The problem is that it fails to parse the string when I provide values from both data-sets (I.e. in this case '5'). For example, if I provide the following string

CRITERIA_1=5;

It returns the following error where the value node was supposed to be:

<unexpected: [@1,11:11='5',<27>,1:11], resync=5>

The grammar definition file is the following:

grammar ZeGrammar;

options {
    language=CSharp3;
    TokenLabelType=CommonToken;
    output=AST;
    ASTLabelType=CommonTree;
    k=3;
}

tokens 
{
    ROOT;
    CRITERIA_1;
    CRITERIA_2;
    OR = 'OR';
    AND = 'AND';
    EOF = ';';
    LPAREN = '(';
    RPAREN = ')';
}

public
start
  : expr EOF -> ^(ROOT expr)
  ;

expr
  : subexpr ((AND|OR)^ subexpr)*
  ;

subexpr
  :   grouppedsubexpr
    | 'CRITERIA_1=' rangeval1_expr -> ^(CRITERIA_1 rangeval1_expr)
    | 'CRITERIA_2=' rangeval2_expr -> ^(CRITERIA_2 rangeval2_expr)
  ;

grouppedsubexpr
  :  LPAREN! expr RPAREN!
  ;

rangeval1_expr
  :   rangeval1_subexpr
    | RANGE1_VALUES
  ;

rangeval1_subexpr
  : LPAREN! rangeval1_expr (OR^ rangeval1_expr)* RPAREN!
  ;

RANGE1_VALUES
  : (('0'..'4')? ('0'..'9') | '5''0')
  ;

rangeval2_expr
  :   rangeval2_subexpr
    | RANGE2_VALUES
  ;

rangeval2_subexpr
  : LPAREN! rangeval2_expr (OR^ rangeval2_expr)* RPAREN!
  ;

RANGE2_VALUES
  : '5' | ('a'|'A') | ('b'|'B') | ('c'|'C')
  ;

And if I remove the value '5' from RANGE2_VALUES it works fine. Can anyone hint me on what I am doing wrong?

AstroCB
  • 12,337
  • 20
  • 57
  • 73
dcg
  • 1,144
  • 1
  • 22
  • 38

1 Answers1

3

You must realize that the lexer does not produce tokens based on what the parser tries to match. So, in your case, the input "5" will always be tokenized as a RANGE1_VALUES and never as a RANGE2_VALUES because both RANGE1_VALUES and RANGE2_VALUES can match this input but RANGE1_VALUES comes first (so RANGE1_VALUES takes precedence over RANGE2_VALUES).

A possible fix would be to remove both RANGE1_VALUES and RANGE2_VALUES rules and replace them with the following lexer rules:

D0_4
  :  '0'..'4'
  ;

D5
  :  '5'
  ;

D6_50
  :  '6'..'9'           // 6-9
  |  '1'..'4' '0'..'9'  // 10-49
  |  '50'               // 50
  ;

A_B_C
  :  ('a'|'A') 
  |  ('b'|'B') 
  |  ('c'|'C')
  ;

and the introduce these new parser rules:

range1_values
  :  D0_4
  |  D5
  |  D6_50
  ;

range2_values
  :  A_B_C
  |  D5
  ;

and change all RANGE1_VALUES and RANGE2_VALUES calls in your parser rules with range1_values and range2_values respectively.

EDIT

Instead of trying to solve this at the lexer-level, you might simply match any integer value and check inside the parser rule if the value is the correct one (or correct range) using a semantic predicate:

range1_values
  :  INT {Integer.valueOf($INT.text) <= 50}?
  ;

range2_values
  :  A_B_C
  |  INT {Integer.valueOf($INT.text) == 5}?
  ;

INT
  :  '0'..'9'+
  ;

A_B_C
  :  'a'..'c'
  |  'A'..'C'
  ;
Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Argh ... I understand what you mean and it's a good idea. I admit it never crossed my mind ... but the sample posted here is just a small piece of my grammar - and to create small subsets for each [possible] interval would be a nightmare. Especially if this needs to be expanded in the future. – dcg Sep 13 '11 at 20:51
  • @dcg, well, there's nothing else you can do about it: you can't let the text `"X"` be tokenized as a `FOO` token at one point, and at some other point be tokenized as a `BAR` token. – Bart Kiers Sep 14 '11 at 06:34
  • Yes, it works! It's a much better approach than using multiple intervals! Thanks! :-) – dcg Sep 14 '11 at 08:25
  • @Bart - +1 a great answer; pleasing to see a good ANTLR question and answer on SO – Andras Zoltan Sep 14 '11 at 10:16