As you are interested in parsing, here is a quickly made example, just to give you a taste. I have learned Lex/Yacc, Flex/Bison, ANTLR v3 and ANTLR v4. I strongly recommend ANTLR4 which is so powerful. References :
The following grammar can parse only the input example you have provided.
File Question.g4
:
grammar Question;
/* Simple grammar example to parse the following code :
file alldataset; append next; xyz;
if file.first? do line + "\n";
if !file.last? do line.indent(2);
end;
end;
file file2; xyz;
*/
start
@init {System.out.println("Question last update 1048");}
: file* EOF
;
file
: FILE ID ';' statement_p*
;
statement_p
: statement
{System.out.println("Statement found : " + $statement.text);}
;
statement
: 'append' ID ';'
| if_statement
| other_statement
| 'end' ';'
;
if_statement
: 'if' expression 'do' expression ';'
;
other_statement
: ID ';'
;
expression
: receiver=( ID | FILE ) '.' method_call # Send
| expression '+' expression # Addition
| '!' expression # Negation
| atom # An_atom
;
method_call
: method_name=ID arguments?
;
arguments
: '(' ( argument ( ',' argument )* )? ')'
;
argument
: ID | NUMBER
;
atom
: ID
| FILE
| STRING
;
FILE : 'file' ;
ID : LETTER ( LETTER | DIGIT | '_' )* ( '?' | '!' )? ;
NUMBER : DIGIT+ ( ',' DIGIT+ )? ( '.' DIGIT+ )? ;
STRING : '"' .*? '"' ;
NL : ( [\r\n] | '\r\n' ) -> skip ;
WS : [ \t]+ -> channel(HIDDEN) ;
fragment DIGIT : [0-9] ;
fragment LETTER : [a-zA-Z] ;
File input.txt
:
file alldataset; append next; xyz;
if file.first? do line + "\n";
if !file.last? do line.indent(2);
end;
end;
file file2; xyz;
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question start -tokens -diagnostics input.txt
[@0,0:0=' ',<WS>,channel=1,1:0]
[@1,1:4='file',<'file'>,1:1]
[@2,5:5=' ',<WS>,channel=1,1:5]
[@3,6:15='alldataset',<ID>,1:6]
[@4,16:16=';',<';'>,1:16]
[@5,17:17=' ',<WS>,channel=1,1:17]
[@6,18:23='append',<'append'>,1:18]
[@7,24:24=' ',<WS>,channel=1,1:24]
[@8,25:28='next',<ID>,1:25]
[@9,29:29=';',<';'>,1:29]
[@10,30:30=' ',<WS>,channel=1,1:30]
[@11,31:33='xyz',<ID>,1:31]
[@12,34:34=';',<';'>,1:34]
[@13,36:36=' ',<WS>,channel=1,2:0]
[@14,37:38='if',<'if'>,2:1]
[@15,39:39=' ',<WS>,channel=1,2:3]
[@16,40:43='file',<'file'>,2:4]
[@17,44:44='.',<'.'>,2:8]
[@18,45:50='first?',<ID>,2:9]
[@19,51:51=' ',<WS>,channel=1,2:15]
[@20,52:53='do',<'do'>,2:16]
[@21,54:54=' ',<WS>,channel=1,2:18]
[@22,55:58='line',<ID>,2:19]
[@23,59:59=' ',<WS>,channel=1,2:23]
[@24,60:60='+',<'+'>,2:24]
[@25,61:61=' ',<WS>,channel=1,2:25]
[@26,62:65='"\n"',<STRING>,2:26]
[@27,66:66=';',<';'>,2:30]
...
[@59,133:132='<EOF>',<EOF>,7:0]
Question last update 1048
Statement found : append next;
Statement found : xyz;
Statement found : if file.first? do line + "\n";
Statement found : if !file.last? do line.indent(2);
Statement found : end;
Statement found : end;
Statement found : xyz;
One advantage of ANTLR4 over previous versions or other parser generators is that the code is no longer scattered among the parser rules, but gathered in a separate listener. This is where you do the actual processing, such as producing a new reformatted file. It would be too long to show a complete example. Today you can write the listener in C++, C#, Python and others. As I don't know Java, I have a machinery using Jruby, see my forum answer.