How to create a antlr4 grammar which will parse date

Question

I want to parse few date format using following ANTLR4 grammar.

grammar Variables;
//varTable : tableNameFormat dateFormat? ;
//tableNameFormat: (ID SEPERATOR);
dateFormat : YEAR UNDERSCORE MONTH UNDERSCORE TODAY
       | YEAR
       ;
YEAR : DIGIT DIGIT DIGIT DIGIT;                         // 4-digits YYYY
MONTH : DIGIT DIGIT;                                    // 2-digits MM
TODAY : DIGIT DIGIT ;                                     // 2-digits DD
UNDERSCORE: ('_' | '-' );
fragment
DIGIT : [0-9] ;
ID : [a-zA-Z][a-zA-Z0-9]? ;
WS  : [ \t\r\n]+ -> skip ;

This grammar should parse "2016-01-01" easily but it's giving input mismatch. Please help

score 3 · Answer 1 · answered Feb 26 '16 at 14:17

3

In my case it works. I am getting a correct parsetree with the input: 2016-01-01

grammar date;

dateFormat : year UNDERSCORE month UNDERSCORE today
       | year
       ;

year : DIGIT DIGIT DIGIT DIGIT
     ;

month : DIGIT DIGIT
      ;

today : DIGIT DIGIT 
      ;

UNDERSCORE: ('_' | '-' );
DIGIT : [0-9] ;

But I would use for month something like (0 [1-9] | 1 [0-2]) because there are only 12 months.

answered Feb 26 '16 at 14:17

xNappy

158
1
4
12

1

yes because you have used parser rules for year, month, day... Try enter them as Lexer rules and it will fail... I am not saying that this approach is bad. Only that there is difference between parser and lexer rules. – Divisadero Feb 26 '16 at 15:16
the way you did I was getting answers the same way... but I do not want **year** to be used as parser, i wanted to use **YEAR** as lexer – raj garg Feb 26 '16 at 15:29
The problem is you can not. You can for Year as it is 4 digits long, but MONTH and DAY are ambigous if used as LEXER rules. 2012-01-20 will be parsed to these tokens => YEAR - MONTH - MONTH and that is invalid – Divisadero Feb 26 '16 at 15:32

score 2 · Accepted Answer · answered Feb 26 '16 at 12:48

2

For such a task regex is much better solution. But if you have it as a study project, here it is...

It is important to realize that order of lexer rules is crucial. Input will be tested by these rules and the first applicable will be used. The rules should be written from the most specific to avoid conflicts. For example, if you have grammar with variable names and some keywords, keywords should be first otherwise they will be marked as variables.

There are many ways you can solve this, but the best would be one lexer rule named DATE : NUM NUM NUM NUM '-' NUM NUM '-' NUM NUM; Month and Day rules as you have them wont work, as they are ambigous. How can lexer tell if two numbers input is month or day?

answered Feb 26 '16 at 12:48

Divisadero

895
5
18

got it. But how regex can be used in such a case where I want to parse a input and determine which date format was given as input?? For example - if input is given as '20160101' then output should be 'YYYYMMDD' and if input is '2016-01-01' then output should be 'YYYY-MM-DD' – raj garg Feb 26 '16 at 15:20
I am sorry, but I do not understand. What is the difference in usage between antlr and regex here? – Divisadero Feb 26 '16 at 15:24
sorry for providing wrong use case.... using antlr as I thought at the time of parsing, using visitor I can tell that year format is **%Y** , month format is **%m** and day format is **%d**, but using regex how would I do that?? – raj garg Feb 26 '16 at 15:35
Now i seem to understand. You need to distinct among 3 cases (for example). 20120101, 2012-01-01 and 2012_01_01. You can do that with antlr of course, it is just my opinion that it is a bit overkill. What about trying one regex, checking if it matches, then checking what splitting char is there and then splitting afterwards. Somehow it seems easier for me (only one method) compared to antlr (few classes) – Divisadero Feb 26 '16 at 15:42
Ok yeah regex will be better for sure then... so i can parse input as follow...but how to check for split char?? `Pattern pattern= Pattern.compile("\\d{4}['-'|'_'|'']\\d{2}['-'|'_'|'']\\d{2}"); Matcher matcher = pattern.matcher(someInput); if(matcher.find()) { //how to check what is the splitting character ?? }` – raj garg Feb 26 '16 at 15:50
It wont be a clean code for sure. But as you are now sure that input matches one of three possible patterns all you need to do is ask if input contains '_', '-' or nothing. Rest is easy. By the way, why do you need to know the input format? – Divisadero Feb 26 '16 at 15:54
actually i want to generate a shell script command for that.... suppose input is **rajgarg_2016_01_01** then it should be translated into **var1='rajgarg'$(date +"%Y%m%d")** – raj garg Feb 26 '16 at 16:09
As soon as you have regex it is easy. Check length of the input. If it is 8 characters it is without any split character. Just use substring function to get the parts. If it is 10 characters, use substring as well and pick 5th character as splitting character. That is all. Not very readable code, but good name of method will suffice :D – Divisadero Feb 26 '16 at 16:18

score 0 · Answer 3 · answered Jan 25 '19 at 08:08

I never worked on Antlr before, but when I looked in GitHub if someone already did which I want. Found this library.

here is a library to parse the date from String.

https://github.com/masasdani/nangka

add this project as a dependency of your project

   <dependency>
        <groupId>com.masasdani</groupId>
        <artifactId>nangka</artifactId>
        <version>0.0.6</version>
    </dependency>

Sample usage :

  String exprEn = "a month later, 20-11-90";
    Nangka nangka = new Nangka();
    DateUnit dateUnit = nangka.parse(exprEn);
    for(Date date : dateUnit.getRelatedDates()){
        System.out.println(date);
    }

Hope this helps someone like me who is searching.

How to create a antlr4 grammar which will parse date

3 Answers3

Linked