Simple Antlr3 Token parsing

Question

while i'm somewhat comforted by the amount of questions regarding Antlr grammar (it's not just me trying to shave this yak shaped thing), i haven't found a question/answer that comes close to helping with my issue.

I'm using Antlr3.3 with a mixed Token/Parser lexer.

I'm using gUnit to help prove the grammar, and some jUnit tests; this is where the fun begins.

I have a simple config file i want to parse:

identifier foobar {
port=8080
stub plusone.google.com {
        status-code = 206
        header = []
        body = []
  }
 }

I'm having trouble parsing the "identifier" (foobar in this example): Valid names i want to allow are:

foobar
foo-bar
foo_bar
foobar2
foo-bar2
foo_bar2
3foobar
_foo-bar3

and so on, therefore a valid name can use the characters 'a..z'|'A..Z', '0..9' '_' and '-'

The grammar i've arrived at is this (note this isnt the full grammar, just the portion pertinent to this question):

fragment HYPHEN : '-' ;

fragment UNDERSCORE : '_' ;

fragment DIGIT  : '0'..'9' ;

fragment LETTER : 'a'..'z' |'A'..'Z' ;

fragment NUMBER : DIGIT+ ;

fragment WORD : LETTER+  ;

IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;

and the corresponding gUnit test

IDENTIFIER:
"foobar" OK
"foo_bar" OK
"foo-bar" OK
"foobar1" OK
"foobar12" OK
"foo-bar2" OK
"foo_bar2" OK
"foo-bar-2" OK
"foo-bar_2" OK
"5foobar" OK
"f_2-a" OK
"aA0_" OK
// no "funny chars"
"foo@bar" FAIL
// not with whitepsace
"foo bar" FAIL

Running the gUnit tests only fails for "5foobar". I've managed to parse the difficult stuff, and yet the seemingly simple task of parsing an identifier has beaten me.

Can anyone point me to where i'm going wrong? How can i match without being greedy?

Many thanks in advance.

-- UPDATE --

I changed the grammar as per Barts answer, to this:

IDENTIFIER : ('0'..'9'| 'a'..'z'|'A'..'Z' | '_'|'-') ('_'|'-'|'a'..'z'|'A'..'Z'|'0'..'9')* ;

and this fixed the failing gUnit tests, but broke an unreleated jUnit test, that tests the "port" parameter. The following grammar deals with the "port=8080" element of the config snippet above:

configurationStatement[MiddlemanConfiguration config]
        :   PORT EQ port=NUMBER {
config.setConfigurationPort(Integer.parseInt(port.getText())); }
            |   def=proxyDefinition { config.add(def); }
;

The message i get is:

mismatched input '8080' expecting NUMBER

Where NUMBER is defined as NUMBER : ('0'..'9')+ ;

Moving the rule for NUMBER above the IDENTIFIER block, fixed this issue.

Bart Kiers · Accepted Answer · 2012-07-09T15:21:59.633

1

IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;

is equivalent to:

IDENTIFIER 
 : DIGIT 
 | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
 ;

So, an IDENTIFIER is eiter a single DIGIT, or starts with a LETTER followed by (LETTER | DIGIT | HYPHEN | UNDERSCORE)*.

You probably meant:

IDENTIFIER 
 : (DIGIT | LETTER | UNDERSCORE) (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
 ;

However, that also allows for 3---3 as being a valid IDENTIFIER, is that correct?

edited Jul 09 '12 at 15:21

answered Jul 09 '12 at 14:44

Bart Kiers

166,582
36
299
288

1

He indicated that `_foo-bar3` is a valid identifier, so the underscore would need to be added as an allowable first character. Also, while ANTLR 4 will not have any performance problems separating the rule into fragments, ANTLR 3 will perform much better if you simply use `'-'` instead of `HYPHEN`, `'0'..'9'` instead of `DIGIT`, etc. – Sam Harwell Jul 09 '12 at 14:50
@280Z28, ah yes, you're right. I only looked at the one gUnit test case the OP mentioned. I added `UNDERSCORE` at the start of the `IDENTIFIER` rule. – Bart Kiers Jul 09 '12 at 15:23
Hi @Bart @280Z28, I've changed the grammar to: `IDENTIFIER : ('0'..'9'| 'a'..'z'|'A'..'Z' | '_'|'-') ('_'|'-'|'a'..'z'|'A'..'Z'|'0'..'9')* ;` and it now matches correctly, however its broken the "port=8080", those test are now failing. I'll edit the question to reflect. – user1512122 Jul 09 '12 at 16:02

Simple Antlr3 Token parsing

1 Answers1