ANTLR - Possible to process tokens with variable, specified lenght in a grammar?

Question

Example String

023abc7defghij

Header

Characters 0, 1 = Size of following chunks

Chunks

First character = length of following string String

Following characters = String with the specified length

Example result

So in the upper example this would mean:

02 -> 2 following chunks

3 -> 3 character String will follow

abc -> the three character string

7 -> 7 character String will follow

defghij -> the seven character string

Question

Can I write a grammar, that describes this form of a string? I would need to interpret the 'length' informations and then build tokens with the specified lenght to fill my objects with the length informations and the strings.

I hope I could describe this comprehensible. I could not find information, describing or solving my problem.

score 2 · Accepted Answer · edited May 23 '17 at 12:12

I'm assuming your actual problem is a bit more complicated, because if "023abc7defghij" is your actual input, I wouldn't use a parser generator like ANTLR, but just stick with some simple string-operations.

That said, here's a possible solution:

Since your chunks are not known up front, you cannot create any tokens other than a single Digit and an Other token that would be any char other than a digit. Note that you don't really need the header information: you simply parse "3" and then get the next 3 chars, then parse the "7" and get the next 7 chars, ... all the way up to the end of the file.

A grammar for such a language could look like this:

grammar T;

parse
  :  file EOF
  ;

file
  :  header chunk*
  ;

header
  :  Digit Digit
  ;

chunk
  :  Digit any*
  ;

any
  :  Digit
  |  Other
  ;

Digit
  :  '0'..'9'
  ;

Other
  :  .
  ;

But now the chunk rule is ambiguous: it does not now when to stop consuming characters. This can be done using a gated semantic predicate that will cause the * from any* to stop consuming when a certain condition has been met (when a counter int n has been counted down, in this case).

The grammar above including this predicate and some println-statements would look like this:

grammar T;

parse
  :  file EOF
  ;

file
  :  header {System.out.println("header=" + $header.text);}
     (chunk {System.out.println("chunk=" + $chunk.text);})*
  ;

header
  :  Digit Digit
  ;

chunk
  :  Digit {int n = Integer.valueOf($Digit.text);} ({n > 0}?=> any {n--;})*
  ;

any
  :  Digit
  |  Other
  ;

Digit
  :  '0'..'9'
  ;

Other
  :  .
  ;

which can be tested with the class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = "023abc7defghij";
    TLexer lexer = new TLexer(new ANTLRStringStream(source));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

If you now generate a lexer and parser, compile all .java file and run the Main class:

java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

you would see the following being printed to your console:

header=02
chunk=3abc
chunk=7defghij

Thanks for this great explanation Bart! I thought about using ANTLR for this, but now, I stick with string operations, as you suggest. My real problem is a lot more complicated than my stripped down example. Thanks again for helping - your example was very instructive. — Kai Mechel, Sep 20 '11 at 05:11
For future readers: note that the syntax has changed in antlr4+ (see e.g. https://github.com/antlr/antlr4/blob/master/doc/predicates.md#using-context-dependent-predicates) — Liam Williams, Nov 20 '17 at 20:35