5

i want to parse something like this in my lexer:

( begin expression )

where expressions are also surrounded by brackets. it isn't important what is in the expression, i just want to have all what's between the (begin and the matching ) as a token. an example would be:

(begin 
    (define x (+ 1 2)))

so the text of the token should be

(define x (+ 1 2)))

something like

PROGRAM : LPAREN BEGIN .* RPAREN;

does (obviously) not work because as soon as he sees a ")", he thinks the rule is over, but i need the matching bracket for this.

how can i do that?

zhujik
  • 6,514
  • 2
  • 36
  • 43
  • If you want to recognize everything between (...) including other nested (...), then you can't do it with a pure, standard regexp based lexer (such as ANTLR has, I think), because regexes can't count --> can't determine when the parentheses are balanced. You need to implement a lexeme recognizers that can count parentheses. – Ira Baxter Aug 03 '11 at 01:04

1 Answers1

5

Inside lexer rules, you can invoke rules recursively. So, that's one way to solve this. Another approach would be to keep track of the number of open- and close parenthesis and let a gated semantic predicate loop as long as your counter is more than zero.

A demo:

T.g

grammar T;

parse
  :  BeginToken {System.out.println("parsed :: " + $BeginToken.text);} EOF
  ;

BeginToken 
@init{int open = 1;}
  :  '(' 'begin' ( {open > 0}?=>              // keep reapeating `( ... )*` as long as open > 0
                     ( ~('(' | ')')           // match anything other than parenthesis
                     | '('          {open++;} // match a '(' in increase the var `open`
                     | ')'          {open--;} // match a ')' in decrease the var `open`
                     )
                 )*
  ;

Main.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String input = "(begin (define x (+ (- 1 3) 2)))";
    TLexer lexer = new TLexer(new ANTLRStringStream(input));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}
java -cp antlr-3.3-complete.jar org.antlr.Tool T.g
javac -cp antlr-3.3-complete.jar *.java
java -cp .:antlr-3.3-complete.jar Main

parsed :: (begin (define x (+ (- 1 3) 2)))

Note that you'll need to beware of string literals inside your source that might include parenthesis:

BeginToken
@init{int open = 1;}
  :  '(' 'begin' ( {open > 0}?=>              // ...
                     ( ~('(' | ')' | '"')     // ...
                     | '('          {open++;} // ...
                     | ')'          {open--;} // ...
                     |  '"' ...               // TODO: define a string literal here
                     )
                 )*
  ;

or comments that may contain parenthesis.

The suggestion with the predicate uses some language specific code (Java, in this case). An advantage of calling a lexer rule recursively is that you don't have custom code in your lexer:

BeginToken 
  :  '(' Spaces? 'begin' Spaces? NestedParens Spaces? ')'
  ;

fragment NestedParens
  :  '(' ( ~('(' | ')') | NestedParens )* ')'
  ;

fragment Spaces
  :  (' ' | '\t')+
  ;
Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • @Ira, it _is_ a single lexeme. Rules that start with an uppercase are tokens in ANTLR. My naming the rule `Root` was perhaps a poor choice, which lead you to believe I pushed the problem to the parser... – Bart Kiers Aug 03 '11 at 06:38
  • Aha. Yeah, as you might guess, I'm not exactly an ANTLR expert. So ...this is the answer I expected you would provide. I'll never doubt you again! – Ira Baxter Aug 03 '11 at 06:51
  • @Bart: i just noticed another problem i have with this: the lexer won't recognize statements like `(bar ...)` correctly anymore, instead he throws away the '(ba' and just returns the `r` as an identifier altough i want to have LPAREN IDENTIFIER. how can i fix this? should i open another question for that and post the grammar again? – zhujik Sep 05 '11 at 18:49
  • 1
    Yeah, that's something the lexer doesn't like. In case of input that "looks" like `(b...`, the lexer has problems backing up to `LPAREN IDENTIFIER` instead of `BeginToken`. Is it really something you want to handle in your lexer? It'd be easier to match a `beginToken` in the parser than a `BeginToken` in the lexer. And yeah, it'd probably be easier to create a new question (in case you really want to do this in the lexer). – Bart Kiers Sep 05 '11 at 21:12