can an element contain attribute as parsed by parser generated by ANTLR? if so, how?

Question

I am following this tutorial and successfully replicated its behavior except that I am using Antlr 4.7 instead of the 4.5 that the tutorial was using.

I am trying to build a DSL for expense tracker.

Was wondering if each element can have attributes?

E.g. this is what it looks like now

This is the code for the todo.g4 as seen in https://github.com/simkimsia/learn-antlr-web-js/blob/master/todo.g4

grammar todo;

elements
    : (element|emptyLine)* EOF
    ;

element
    : '*' ( ' ' | '\t' )* CONTENT NL+
    ;

emptyLine
    : NL
    ;

NL
    : '\r' | '\n' 
    ;

CONTENT
    : [a-zA-Z0-9_][a-zA-Z0-9_ \t]*
    ;

Meaning to say the element will also have 2 attributes such as amount and payee. To keep it simple, I will have the same sentence structure so to allow parsing to be done more easily.

the format will be pay [payee] [amount]

the example is pay Acme Corp 123,789.45

so the payee is Acme Corp and the amount is 12378945 as expressed in integers to denote the amount in denominations of cents

another example is pay Banana Inc 700

so the payee is Banana Inc and the amount is 70000 as expressed in integers to denote the amount in denominations of cents

I am guessing I need to change the todo.g4 and then re generate the parser.

Can an element have other attributes? If so, how do I get started?

UPDATE

This is my latest attempts ranked with latest updates on top:

I just figured out how to use grun and testRig. Thanks @Raven for that tip.

latest attempt: My latest expense.g4 (only difference from earlier attempt is the regex for payment)

grammar expense;

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: ([0-9]+(','[0-9]+)*)('.'[0-9]*)?;
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

Earlier attempt: This is my expense.g4

grammar expense;

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: [0-9]+ (',' [0-9]+)+ ('.' [0-9]+)? ;  
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

Earlier attempt: https://github.com/simkimsia/learn-antlr-web-js/commit/728813ac275a3f2ad16d7f51ce15fcc27d40045b#commitcomment-25127606

Earlier attempt: https://github.com/simkimsia/learn-antlr-web-js/commit/0c32aec6ffb4b4275db86d54e9788058a2ce8759#commitcomment-25125695

I don't understand all the code, but have an eagle eye for typos. Line 56 : `var tokens = new new antlr4.CommonTokenStream(expenseLexer);` --> two times new, probable cause of error. — BernardK, Oct 23 '17 at 07:11
@BernardK Thanks. I have removed the extra new and used the payments function as suggested by Raven. I still see empty array when I tried to console.log — Kim Stacks, Oct 23 '17 at 07:24

score 2 · Accepted Answer · answered Oct 24 '17 at 17:19

Situation on October 24. 2017 at 19:00 UTC+1.

Your grammar works perfectly. I made a full test in Java.

File Expense.g4 :

grammar Expense;

payments
@init {System.out.println("Expense last update 1853");}
    : (payment NL)*
    ;

payment
    : PAY receiver amount=NUMBER
      {System.out.println("Payement found " + $amount.text + " to " + $receiver.text);}
    ;

receiver
    : surname=ID (lastname=ID)?
    ; 

PAY    : 'pay' ;
NUMBER : ([0-9]+(','[0-9]+)*)('.'[0-9]*)? ;
ID     : [a-zA-Z0-9_]+ ;
NL     : '\n' | '\r\n' ;  
WS     : [\t ]+ -> channel(HIDDEN) ; // keep the spaces (witout spaces ==> paydeltaco98)

File ExpenseMyListener.java :

public class ExpenseMyListener extends ExpenseBaseListener {
    ExpenseParser parser;
    public ExpenseMyListener(ExpenseParser parser) { this.parser = parser; }

    public void exitPayments(ExpenseParser.PaymentsContext ctx) {
        System.out.println(">>> in ExpenseMyListener for paymentsss");
        System.out.println(">>> there are " + ctx.payment().size() + " elements in the list of payments");
        for (int i = 0; i < ctx.payment().size(); i++) {
            System.out.println(ctx.payment(i).getText());
        }
    }

    public void exitPayment(ExpenseParser.PaymentContext ctx) {
        System.out.println(">>> in ExpenseMyListener for payment");
        System.out.println(parser.getTokenStream().getText(ctx));
    }
}

File test_expense.java :

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.ParserRuleContext;
import org.antlr.v4.runtime.tree.*;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;

public class test_expense {
    public static void main(String[] args) throws IOException {
        ANTLRInputStream input = new ANTLRFileStream(args[0]);
        ExpenseLexer lexer = new ExpenseLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        ExpenseParser parser = new ExpenseParser(tokens);
        ParseTree tree = parser.payments();
        System.out.println("---parsing ended");
        ParseTreeWalker walker = new ParseTreeWalker();
        ExpenseMyListener my_listener = new ExpenseMyListener(parser);
        System.out.println(">>>> about to walk");
        walker.walk(my_listener, tree);
    }
}

Input file top.text :

pay Acme Corp 123,456
pay Banana Inc 456789.00
pay charlie pte 123,456.89
pay delta co 98

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Expense.g4 
$ javac Ex*.java
$ javac test_expense.java 
$ grun Expense payments -tokens -diagnostics top.text
[@0,0:2='pay',<'pay'>,1:0]
[@1,3:3=' ',<WS>,channel=1,1:3]
[@2,4:7='Acme',<ID>,1:4]
[@3,8:8=' ',<WS>,channel=1,1:8]
[@4,9:12='Corp',<ID>,1:9]
...
[@32,90:89='<EOF>',<EOF>,5:0]
Expense last update 1853
Payement found 123,456 to Acme Corp
Payement found 456789.00 to Banana Inc
Payement found 123,456.89 to charlie pte
Payement found 98 to delta co

$ java test_expense top.text 
Expense last update 1853
Payement found 123,456 to Acme Corp
Payement found 456789.00 to Banana Inc
Payement found 123,456.89 to charlie pte
Payement found 98 to delta co
---parsing ended
>>>> about to walk
>>> in ExpenseMyListener for payment
pay Acme Corp 123,456
>>> in ExpenseMyListener for payment
pay Banana Inc 456789.00
>>> in ExpenseMyListener for payment
pay charlie pte 123,456.89
>>> in ExpenseMyListener for payment
pay delta co 98
>>> in ExpenseMyListener for paymentsss
>>> there are 4 elements in the list of payments
payAcmeCorp123,456
payBananaInc456789.00
paycharliepte123,456.89
paydeltaco98

Thanks for this. I prefer that you updated ur answer because I wanted to reward it with a bounty of 100. By the way, I am really new to antlr and DSL in general. My long aim is to build my own expense tracker using my own DSL. After I can do the parsing, what should I do? Should I store the DSL language as it is into a database? or i should store the individual tokens into individual fields in a database? — Kim Stacks, Oct 25 '17 at 01:20
No experience with DSL, I will soon read Tomassetti's book. Somewhere you have to write an application. It could be in the `exitPayment` listener rule that you store `receiver` and `amount` to a database, not the tokens, which are mainly used by the parser. — BernardK, Oct 25 '17 at 07:15
Thank you. I am in direct contact with Tomassetti and have his book. Cheers — Kim Stacks, Oct 26 '17 at 01:27

score 1 · Answer 2 · answered Oct 22 '17 at 11:37

1

I'm not entirely sure what exactly you want but for the provided examples this grammar should do the job:

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: [0-9]+ (',' [0-9]+)+ ('.' [0-9]+)? ;  
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

If this is what you were asking for I will add some more explanation if needed...

answered Oct 22 '17 at 11:37

Raven

2,951
2
26
42

how do i test your answer? just replace everything inside todo.g4 with this? – Kim Stacks Oct 22 '17 at 14:01
yep... except the `grammar todo` header – Raven Oct 22 '17 at 16:44
I did as u suggested. How do I "see" that the parsing is successful? currently, I have this result https://github.com/simkimsia/learn-antlr-web-js/commit/728813ac275a3f2ad16d7f51ce15fcc27d40045b#commitcomment-25127606 – Kim Stacks Oct 23 '17 at 07:25
1

Use the testRig to display the parse tree – Raven Oct 23 '17 at 18:46
Thanks. your tip on testRig made me realise what BernardK was writing about as well. It helped. – Kim Stacks Oct 24 '17 at 12:16
I giving you a 100 points because your answer helped me a lot. Thank you very much. – Kim Stacks Oct 26 '17 at 01:26

BernardK · Answer 3 · 2017-10-22T15:58:25.020

I am guessing I need to change the todo.g4 and then re generate the parser.

Of course regenerate after each change. For me it's :

$ a4 Question.g4
$ javac Q*.java
$ grun Question elements -tokens -diagnostics t.text

where

$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'

The more you describe specific contents, the more you may face ambiguity problems. For example, you have two rules :

payment   : 'pay' [payee] [amount]
free_text : ... any character ...

Consider the following content :

* pay Federico Tomassetti 10 € for the tutorial

* pay Federico Tomassetti 10 is ambiguous and can be matched by the two rules, but it will finally be parsed as free text, because of € for the tutorial which doesn't satisfy payment.

If later you change the payment rule to accept more info after the amount :

payment   : 'pay' [payee] [amount] payment_info

the above content will be matched by payment (in case of ambiguity ANTLR chooses the first rule). The good news is that ANTLR 4 is very strong to disambiguate, it reads the whole file if necessary.

For ambiguous tokens and precedence rules, read the posts of these last three weeks, a lot have been said.

Mixing Raven's grammar with yours, this is one possible solution :

File Question.g4

grammar Question;

elements
@init {System.out.println("Question last update 1432");}
    : ( element | emptyLine )* EOF
    ;

element
    : '*' content NL
    ;

content
    : payment   //{System.out.println("Payement found " + $payment.text);}
    | free_text {System.out.println("Free text found " + $free_text.text);}
    ;

payment
    : PAY receiver amount=NUMBER
      {System.out.println("Payement found " + $amount.text + " to " + $receiver.text);}
    ;

receiver
    : surname=WORD ( lastname=WORD )?
    ;  

free_text
    : ( WORD | PAY | NUMBER )+
    ;

emptyLine
    : NL
    ;

PAY    : 'pay' ;
WORD   : LETTER ( LETTER | DIGIT | '_' )* ;
NUMBER : DIGIT+ ( ',' DIGIT+ )? ( '.' DIGIT+ )? ;  

NL  : [\r\n]
    | '\r\n' 
    ;
//WS  : [ \t]+ -> skip ; // $payment.text => payAcmeCorp123,789.45
WS  : [ \t]+ -> channel(HIDDEN) ; // spaces are needed to nicely display $payment.text

fragment DIGIT  : [0-9] ;
fragment LETTER : [a-zA-Z] ;

File t.text

* play with ANTLR 4
* write a tutorial
* pay Acme Corp 123,789.45
* pay Banana Inc 700
* pay Federico Tomassetti 10 € for the tutorial

Execution :

$ grun Question elements -tokens -diagnostics t.text
line 5:29 token recognition error at: '€'
[@0,0:0='*',<'*'>,1:0]
[@1,1:1=' ',<WS>,channel=1,1:1]
[@2,2:5='play',<WORD>,1:2]
[@3,6:6=' ',<WS>,channel=1,1:6]
[@4,7:10='with',<WORD>,1:7]
[@5,11:11=' ',<WS>,channel=1,1:11]
[@6,12:16='ANTLR',<WORD>,1:12]
[@7,17:17=' ',<WS>,channel=1,1:17]
[@8,18:18='4',<NUMBER>,1:18]
[@9,19:19='\n',<NL>,1:19]
[@10,20:20='*',<'*'>,2:0]
[@11,21:21=' ',<WS>,channel=1,2:1]
[@12,22:26='write',<WORD>,2:2]
[@13,27:27=' ',<WS>,channel=1,2:7]
[@14,28:28='a',<WORD>,2:8]
[@15,29:29=' ',<WS>,channel=1,2:9]
[@16,30:37='tutorial',<WORD>,2:10]
[@17,38:38='\n',<NL>,2:18]
...
[@56,136:135='<EOF>',<EOF>,7:0]
Question last update 1432
Free text found play with ANTLR 4
Free text found write a tutorial
line 3:26 reportAttemptingFullContext d=2 (content), input='pay Acme Corp 123,789.45
'
...
Payement found 700 to Banana Inc
Free text found pay Federico Tomassetti 10  for the tutorial

As you can see, the € symbol is not recognized. You may need a CONTENT rule similar to FIELDTEXT here, and then you get into trouble ...

Federico's Mega tutorial is a good start. For nitty-gritty details, see The Definitive ANTLR 4 Reference or the online doc from www.antlr.org.

I am using the javascript runtime of antlr 4.7 and I followed tomassetti's tutorial quite strictly. so I use the Parse button to kick start the parsing. Your answer appears to use java runtime? is there a way I can use the javascript runtime instead? — Kim Stacks, Oct 22 '17 at 14:05
Sorry, I have never used other targets than Java. You need to remove/replace the `{System.out.println...`. I have added a link to Tomassetti's Mega Tutorial which uses several different target languages. — BernardK, Oct 22 '17 at 16:04
I just realise a lot clearer what you were trying to do at the start of your answer. — Kim Stacks, Oct 24 '17 at 12:20

can an element contain attribute as parsed by parser generated by ANTLR? if so, how?

3 Answers3

Linked