7

I know '+', '?' and '*'. But what if I want something repeats itself for, say, 5 times? For example, if an identifier must be a string of hexdecimal numbers of length 5?

To be more specific, I'm thinking about define a general lexer rule of unlimited length, and then, at parsing time count how many time it repeated, if it equals to 5, then rename it as another type of token, but how can I do this? Or is there some easy way?

safarisoul
  • 257
  • 1
  • 5
  • 13

2 Answers2

5

at parsing time count how many time it repeated, if it equals to 5, then rename it as another type of token, but how can I do this? Or is there some easy way?

Yes, you can do that with a disambiguating semantic predicate (explanation):

grammar T;

parse
 : (short_num | long_num)+ EOF
 ;

short_num
 : {input.LT(1).getText().length() == 5}? NUM
 ;

long_num
 : {input.LT(1).getText().length() == 8}? NUM
 ;

NUM
 : '0'..'9'+
 ;

SP
 : ' ' {skip();}
 ;

which will parse the input 12345 12345678 as follows:

enter image description here

But you can also change the type of the token in the lexer based on some property of the matched text, like this:

grammar T;

parse
 : (SHORT | LONG)+ EOF
 ;

NUM
 : '0'..'9'+
   {
     if(getText().length() == 5) $type = SHORT;
     if(getText().length() == 8) $type = LONG;
     // when the length is other than 5 or 8, the type of the token will stay NUM
   }
 ;

SP
 : ' ' {skip();}
 ;

fragment SHORT : ;
fragment LONG : ;

which will cause the same input to be parsed like this:

enter image description here

Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
1

You need to specify it 5 times, for example:

ZIPCODE: '0'..'9' '0'..'9' '0'..'9' '0'..'9' '0'..'9'; 

Alternatively, you can use a validating semantic predicate:

DIGIT: '0'..'9';
zipcode
@init { int N = 0; }
  :  (DIGIT { N++; } )+ { N <= 5 }?
  ;

See: What is a 'semantic predicate' in ANTLR?

Community
  • 1
  • 1
Diego
  • 18,035
  • 5
  • 62
  • 66
  • Hi, this works in Parser grammar. Is it possible to do it in Lexer grammar? – safarisoul Mar 07 '12 at 01:48
  • Okay, I've got it work in Lexer grammar now. But I can only have one such rule. Is it possible if I want to name such a token of length 5 as SHORT, and at the same time name such a token of length 8 as LONG? Antlr complaint "The following token definitions can never be matched". – safarisoul Mar 07 '12 at 05:35
  • I mean, in this way, every time when {}? evaluated as false, those characters will be ignored, but I want the lexer check for other potential matches. – safarisoul Mar 07 '12 at 05:40