1

I'm starting exploring ANTLR and I'm trying to match this format: (test123 A0020 )

Where :

  • test123 is an Identifier of max 10 characters ( letters and digits )
  • A : Time indicator ( for Am or Pm ), one letter can be either "A" or "P"
  • 0020 : 4 digit format representing the time.

I tried this grammar :

    IDENTIFIER
:
    ( LETTER | DIGIT ) +
;
    INT
:
    DIGIT+
;
fragment
DIGIT
:
    [0-9]
;

fragment
LETTER
:
    [A-Z]
;

WS : [ \t\r\n(\s)+]+ -> channel(HIDDEN) ;
formatter:  '(' information ')';

information : 
information '/' 'A' INT 
        |IDENTIFIER ;

How can I resolve the ambiguity and get the time format matched as 'A' INT not as IDENTIFIER? Also how can I add checks like length of token to the identifier? I tknow that this doesn't work in ANTLR : IDENTIFIER : (DIGIT | LETTER ) {2,10}

UPDATE:

I changed the rules to have semantic checks but I still have the same ambiguity between the identifier and the Time format. here's the modified rules:

formatter
    : information
    | information '-' time
    ;

time :
    timeMode timeCode;  

timeMode:   
    { getCurrentToken().getText().matches("[A,C]")}? MOD
;

timeCode: {getCurrentToken().getText().matches("[0-9]{4}")}?  INT;

information: {getCurrentToken().getText().length() <= 10 }? IDENTIFIER;

MOD:  'A' | 'C';

So the problem is illustrated in the production tree, A0023 is matched to timeMode and the parser is complaining that the timeCode is missing enter image description here

ps_messenger
  • 111
  • 1
  • 9
  • Check this [question](http://stackoverflow.com/questions/3056441/what-is-a-semantic-predicate-in-antlr). Although you would have to convert your lexer rules to parser rules. The naive way is to write `IDENTIFIER: (LETTER | DIGIT) (LETTER | DIGIT) ...` ten times. – Mephy Mar 09 '16 at 12:15
  • Why not tokenize `A0023` as a single TIME token? – Bart Kiers Mar 10 '16 at 10:43
  • @BartKiers because I want to include actions in the semantic rules later on without having to treat the 'A0023' as a String.( I will have to do operations if I want to separate the timeMode and timeCode ) I actually have the same problem in another parser for distance unit recognition ( format [M]\d{3} for distance in meter or [F]\d{4} in feets ) – ps_messenger Mar 10 '16 at 12:30
  • I'm assuming the following inputs are all identifiers: `P123`, `P12345`, `P`. Correct? – Bart Kiers Mar 10 '16 at 12:33
  • Correct 1P23 12PP23, also are identifiers – ps_messenger Mar 10 '16 at 12:35

3 Answers3

1

Here is a way to handle it:

grammar Test;

@lexer::members {
  private boolean isAhead(int maxAmountOfCharacters, String pattern) {
    final Interval ahead = new Interval(this._tokenStartCharIndex, this._tokenStartCharIndex + maxAmountOfCharacters - 1);
    return this._input.getText(ahead).matches(pattern);
  }
}

parse
 : formatter EOF
 ;

formatter
 : information ( '-' time )?
 ;

time
 : timeMode timeCode
 ;

timeMode
 : TIME_MODE
 ;

timeCode
 : {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\d{4}")}?
   IDENTIFIER_OR_INTEGER
 ;

information
 : {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\w*[a-zA-Z]\\w*")}?
   IDENTIFIER_OR_INTEGER
 ;

IDENTIFIER_OR_INTEGER
 : {!isAhead(6, "[AP]\\d{4}(\\D|$)")}? [a-zA-Z0-9]+
 ;

TIME_MODE
 : [AP]
 ;

SPACES
 : [ \t\r\n] -> skip
 ;

A small test class:

public class Main {

    private static void indent(String lispTree) {

        int indentation = -1;

        for (final char c : lispTree.toCharArray()) {
            if (c == '(') {
                indentation++;
                for (int i = 0; i < indentation; i++) {
                    System.out.print(i == 0 ? "\n  " : "  ");
                }
            }
            else if (c == ')') {
                indentation--;
            }
            System.out.print(c);
        }
    }

    public static void main(String[] args) throws Exception {
        TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
        TestParser parser = new TestParser(new CommonTokenStream(lexer));
        indent(parser.parse().toStringTree(parser));
    }
}

will print:

(parse 
  (formatter 
    (information 1P23) - 
    (time 
      (timeMode A) 
      (timeCode 0023))) <EOF>)

for the input "1P23 - A0023".

EDIT

ANTLR also can output the parse tree on UI component. If you do this instead:

public class Main {

    public static void main(String[] args) throws Exception {
        TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
        TestParser parser = new TestParser(new CommonTokenStream(lexer));
        new TreeViewer(Arrays.asList(TestParser.ruleNames), parser.parse()).open();
    }
}

the following dialog will appear:

enter image description here

Tested with ANTLR version 4.5.2-1

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Awesome thanks, I tried it and it's working. ps: for some reason the parse tree generated with ANTLR eclipse plugin shows a different production ( timeMode missing and timeCode has A0023) – ps_messenger Mar 10 '16 at 14:31
  • @JCCNoobie I tested with the latest ANTLR version 4.5.2-1 – Bart Kiers Mar 10 '16 at 15:01
  • @JCCNoobie also see my EDIT. – Bart Kiers Mar 10 '16 at 15:11
  • This is so helpful thanks! The Parse Tree view from the ANTLR4 eclipse plugin doesn't show the same production so I will just use the TreeViewer instead. Thanks! – ps_messenger Mar 10 '16 at 16:21
  • I have a problem with the small test class. every time i run it with the parser I get just this string (parse ) even though i'm passing a full string to parse and it's beeing processed by the parser. Any idea what is the problem ? – ps_messenger Jun 01 '16 at 12:18
0

Using semantic predicates (check this amazing QA), you can define parser rules for your specific model, having logic checks that the information can be parsed. Note this is only an option for parser rules, not lexer rules.

information
    : information '/' meridien time
    | text
    ;
meridien
    : am
    | pm
    ;
am: {input.LT(1).getText() == "A"}? IDENTIFIER;
pm: {input.LT(1).getText() == "P"}? IDENTIFIER;
time: {input.LT(1).getText().length == 4}? INT;
text: {input.LT(1).getText().length <= 10}? IDENTIFIER;
Community
  • 1
  • 1
Mephy
  • 2,978
  • 3
  • 25
  • 31
  • I don't have Antlr right now to test this, so this may be off a little. Also, I'm using Java in the predicates, you may have to adjust to your host language if it's not Java. – Mephy Mar 09 '16 at 12:24
  • Thanks, so if I understand correctly, if I only want to have a lexer for my grammar I can't have specific checking rules for length. And if I need to apply a regex check It also has to be in the semantic predicates. – ps_messenger Mar 09 '16 at 12:59
  • @JCCNoobie You can do it in the lexer, it's just not as elegant. For example, you may just repeat yourself, like `R: [A-Z] | [A-Z] [A-Z] | [A-Z] [A-Z] [A-Z];` for up to three letters. – Mephy Mar 09 '16 at 13:14
0
compileUnit
    :   alfaNum time
    ;

alfaNum : (ALFA | MOD | NUM)+;
time : MOD NUM+;

MOD:  'A' | 'P';
ALFA: [a-zA-Z];
NUM:  [0-9];

WS
    :   ' ' -> channel(HIDDEN)
    ;

You need to avoid ambiguity by including MOD into alfaNum rule.

Divisadero
  • 895
  • 5
  • 18
  • Note "A" is an ambiguous lexicom, it could be either `MOD` or `ALFA`, and the lexer doesn't have the context to decide which one. – Mephy Mar 09 '16 at 13:13
  • it has, because alfaNum has MOD included. it is alright this way, i have tested and used it before – Divisadero Mar 09 '16 at 13:23
  • I tried this solution, it resolves the ambiguity since it matches each character. the problem with this that if I have a list of Information strings (like TEST1 TEST2 TEST3 A0020) they will be matched as a same long alfaNum ( so I can't for example do this treatment: for each information store name ) http://s18.postimg.org/71vfzbepl/production.png – ps_messenger Mar 09 '16 at 14:31
  • I have to admit that i have focused only on the core problem, but i do not understand now, how is it possible that three words will match. alfaNum rule wont match space and compileUnit (very bad name :D ) wont match more than two words – Divisadero Mar 09 '16 at 14:42
  • Yeah, unfortunately you are right. I have to get back to this problem. Meanwhile define you can insert WS between alfaNum and time. Remove channel Hidden of course. I will find out the solution and let you know. – Divisadero Mar 09 '16 at 15:37