0

I have been searching the net for the past few hours, trying to learn a simple example of using ANTLR.But I am having a hard time understanding the examples. Does any body have simple example that would output this in Java:

if my input is printf("Hello World");

the output should be :

Hello World

and if my input is

inx =1;

it should give an error message.

I'm trying to create a c++ compiler(starting with the lexical up until the semantic part only) using java,and I would really like to know what I should do.

marchemike
  • 3,179
  • 13
  • 53
  • 96
  • What is the grammar of your input? Is that input of yours all one token?? If so thats easy! If you want to parse it like C, then its far from "simple." – Austin Henley Mar 20 '12 at 15:49
  • For the record, C++ is very hard to parse correctly. It is context sensitive. – Austin Henley Mar 20 '12 at 15:55
  • And you mention your output... So you are writing an interpreter, not a compiler? – Austin Henley Mar 20 '12 at 15:58
  • @AustinHenley is that so?Then what could you suggest would be an alternative programming language that I could use to make my grammar much easier?Would Python be a good programming language? – marchemike Mar 20 '12 at 16:02
  • @AustinHenley I'm actually planning to create an interpreter that if my input passes all the analysis parts, I'll just port it to a ready made output generator, like in C:% gcc -c – marchemike Mar 20 '12 at 16:05
  • My first language was a subset of Pascal called PL/0. It was designed to be easy to implement. BASIC is also very easy. – Austin Henley Mar 20 '12 at 16:08
  • @AustinHenley or should I just use the grammars listed in the ANTLR website and embed that to my program?but that would mean I won't learn anything. – marchemike Mar 20 '12 at 16:09
  • @MichaelEvangelista Check out my answer below. Also dont confuse parsing with compiling. – Austin Henley Mar 20 '12 at 17:30
  • See: http://stackoverflow.com/questions/278480/antlr-tutorials – Bart Kiers Mar 20 '12 at 17:43

2 Answers2

5

Here is a grammar that almost does what you want:

grammar PrintLang;

sentence 
    :    statement
    ;

statement 
    :   functionCall '(' argument ')' ';'
    { 
      if ($functionCall.funName.equals("printf")) {
        System.out.println($argument.arg);
      }
    }
    ;

functionCall returns [String funName]
    :    ID 
    { $funName = $ID.text; }
    ;

argument returns [String arg]
    :   STRING
    { $arg = $STRING.text; }
    ;

ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

I generated this in AntlrWorks. All of the token rules were generated for me.

Here is the java file to test it.

import org.antlr.runtime.*;


public class PrintIt {
  public static void main(String args[]) {
    String inputString = "printf(\"HelloWorld\");";

    // Create an input character stream from standard in
    ANTLRStringStream input = new ANTLRStringStream(inputString); 
    // Create an ExprLexer that feeds from that stream 
    PrintLangLexer lexer = new PrintLangLexer(input);
    // Create a stream of tokens fed by the lexer 
    CommonTokenStream tokens = new CommonTokenStream(lexer); 
    // Create a parser that feeds off the token stream 
    PrintLangParser plParser = new PrintLangParser(tokens);
    try {
        plParser.sentence();
    } catch (Exception e) {
        e.printStackTrace();
    }
  }
}

You'll note that this java code is almost a verbatim copy/paste from the Antlr website example (I don't believe I even changed the comments, which is why the comment refers to Standard in, but the code actually uses a String). And here is the command line I used to do it.

bash$ java -cp ./antlr-3.4-complete.jar org.antlr.Tool PrintLang.g
bash$ javac -cp ./:./antlr-3.4-complete.jar PrintIt.java 
bash$ java -cp antlr-3.4-complete.jar:. PrintIt
"HelloWorld"

Oops, I forgot that the string I wanted to print isn't the matched token ("HelloWorld", including the quotes), it's the string within the quotes.

Also, you'll note that I hardcoded the lookup of printf as a string comparison. In reality, you'll want an environment that contains the symbols accessible at a given scope (related, see antlr's "scope" construct. More difficult, though sometimes useful: create an environment that you pass to each parsing rule).

Most important: find Bart Kiers answers by searching SO for more antlr questions. He posts excellent examples.

ccoakley
  • 3,235
  • 15
  • 12
2

From ANTLR here is the trivial example of parsing (and evaluating) an expression.

grammar Expr;

@header {
package test;
import java.util.HashMap;
}

@lexer::header {package test;}

@members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}

prog:   stat+ ;

stat:   expr NEWLINE {System.out.println($expr.value);}
    |   ID '=' expr NEWLINE
        {memory.put($ID.text, new Integer($expr.value));}
    |   NEWLINE
    ;

expr returns [int value]
    :   e=multExpr {$value = $e.value;}
        (   '+' e=multExpr {$value += $e.value;}
        |   '-' e=multExpr {$value -= $e.value;}
        )*
    ;

multExpr returns [int value]
    :   e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;})*
    ; 

atom returns [int value]
    :   INT {$value = Integer.parseInt($INT.text);}
    |   ID
        {
        Integer v = (Integer)memory.get($ID.text);
        if ( v!=null ) $value = v.intValue();
        else System.err.println("undefined variable "+$ID.text);
        }
    |   '(' e=expr ')' {$value = $e.value;}
    ;

    ID  :   ('a'..'z'|'A'..'Z')+ ;
    INT :   '0'..'9'+ ;
    NEWLINE:'\r'? '\n' ;
    WS  :   (' '|'\t')+ {skip();} ;

But like I mentioned in my comments, C++ is very hard to parse correctly. There are many ambiguities and requires * amount of look ahead (which ANTLR does provide). So doing this in any efficient form is complicated. That is why I recommend implementing something like PL/0 which was designed for students to write their first compiler for. Tiny BASIC is also a good start. Both of these can be implemented without using a tool like ANTLR by doing recursive descent. I have implemented both in under 1000 lines together (in C++ and C# respectively).

ANTLR is a great tool though, especially once you get your head wrapped around recursive descent you might want to upgrade to a more powerful parser. I recommend both of Terrence Parr's books, ANTLR Reference and Language Implementation Patterns. The ANTLR book will tell you everything (plus some) that you want to know about ANTLR. The second book will teach you all about parsers and compilers, from recursive descent to black-magic backtracking.

More resources from a similar SO question can be found here. And if you're into Lisp or Scheme, you can check out JScheme, it is written in Java (less than 1000 lines I believe).

Community
  • 1
  • 1
Austin Henley
  • 4,625
  • 13
  • 45
  • 80