ANTLR lexer exclude string

Question

Hej everyone

I'm trying to build a lexer used to parse a domain specific language. I have a set of reserved token (fragment RESERVED) and an escape character. The lexer should split whenever a reserved token shows up that is not escaped.

a simplified example:

SEP: ';';
AND: '&&';

fragment ESCAPE: '/';    
fragment RESERVED: SEP | AND | ESCAPE;

SINGLETOKEN : (~(RESERVED) | (ESCAPE RESERVED))+;

problem:

This works fine as long as RESERVED only contains single character token. The negation operation ~ only works for single chars.

Unfortunately I need it to work with string token as well. So token with more then 1 character (see AND in the example). Is there a simple way to do so? I need to solve the problem without in-lining java or c code since this has to compile to different languages and I don't want to maintain separate copies.

I hope someone can help me

sample input from the whole script

create;false;false;1.key = bla; trig;true;false;(1.key1 ~ .*thisIsRegex || 2.oldKey1 €) && (1.bla=2.blub || 1.blub=bla);

After the Lexer this should look like this | are token seperator whitespaces are not important:|create|;|false|;|false|;|1.|key| = |bla|;| trig|;|true|;|false|;|(|1.|key1| ~| .*thisIsRegex| || |2.|oldKey1| €|)| && |(|1.|bla|=|2.|blub| || |1.|blub|=|bla|)|;|

Whole script can be found on http://pastebin.com/Cz520VW4 (note this link expires in a month) It currently does not work for the regex part yet.

possible but horrible solution

I found a possible solution but its really hacky and makes the script more error prone. So I would prefer to find something cleaner.

What Im currently doing is writing the negation (~RESERVED) by hand.

SEP: ';';
AND: '&&';

fragment ESCAPE: '/';    
fragment RESERVED: SEP | AND | ESCAPE;

NOT_RESERVED:
   :  '&' ~('&' | SEP | ESCAPE)  
   // any two chars starting with '&' followed by a character other then a reserve character
   |  ~('&' | SEP | ESCAPE) ~(SEP | ESCAPE)   
   // other than reserved character followed by '&' followed by any char
   ;
SINGELTON : (NOT_RESERVED | (ESCAPE RESERVED))+;

The real script has more then 5 multi-character token there might be more later with more then 2 character so this way of solving the problem it will become quite complicated.

Added a link to the whole script and an example input. Sorry if its that messy its my first ever post on stackoverflow I need some time to get used to the editor and all the features and things. — user1549692, Aug 15 '12 at 21:44

score 2 · Answer 1 · edited May 23 '17 at 12:20

It currently does not work for the regex part yet ...

That is because you've declared regex literals to be just about anything. What if a regex literal starts with a reserved token, like this: 1.key1 ~ falser?

In short, I don't recommend you implement your lexer the way you're trying to do it now. Instead, do as almost every programming language implements regex-literals: let them be encapsulated by delimiters (or quoted strings):

1.key1 ~ /falser/

or:

1.key1 ~ "falser"

You could add a flag in your lexer that is flipped whenever a ~ is encountered, and depending on that flag, create a regex literal. Here's a small demo of how to do that:

grammar TriggerDescription;

options {
  output=AST;
}

tokens {

 // seperator
 SEP = ';';

 // md identifier
 OLDMD   =        '1.';
 NEWMD   =        '2.';

 // boolean
 TRUE    =        'true';
 FALSE   =        'false';

//atoms
 EX     =       '€';
 EQ     =       '=';
 SMEQ   =       '<=';
 GREQ   =       '>=';
 GR     =       '>';
 SM     =       '<';

 // literals
 AND    =        '&&';
 OR     =        '||';
 NOT    =        '!';
 OPENP  =        '(';
 CLOSEP =        ')';

 // token identifier
 TRIGGER = 'TRIGGER';
 REPFLAG = 'REPFLAG';
 LOCALFLAG = 'LOCALFLAG';
 TRIGGERID = 'TRIGGERID';
 EVALUATOR = 'EVALUATOR';
 ROOT = 'ROOT';
}

@lexer::members {
  private boolean regexExpected = false;
}

parse
        : trigger+ EOF -> ^(ROOT trigger+)
          //(t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
        ;

trigger
        : triggerid SEP repflag SEP exeflag SEP mdEval SEP -> ^(TRIGGER triggerid repflag exeflag mdEval)
        ;

triggerid
        : rs = WORD     -> ^(TRIGGERID[$rs])
        ;

repflag
        : rs = TRUE     -> ^(REPFLAG[$rs])
        | rs = FALSE    -> ^(REPFLAG[$rs])
        ;

exeflag
        : rs = TRUE     -> ^(LOCALFLAG[$rs])
        | rs = FALSE    -> ^(LOCALFLAG[$rs])
        ;

mdEval
        : orExp         -> ^(EVALUATOR orExp)
        ;

orExp
        :  andExp (OR^ andExp)* // Make `||` root
        ;

andExp
        :  notExp (AND^ notExp)* // Make `##` root
        ;

notExp
        :  (NOT^)*atom // Make `!` root
        ;

atom
        :       key EX^
        |       key MT^ REGEX
        |       key EQ^ (key | WORD)
        |       key GREQ^ (key | WORD)
        |       key SMEQ^ (key | WORD)
        |       key GR^ (key | WORD)
        |       key SM^ (key | WORD)
        |       OPENP orExp CLOSEP -> orExp // removing the parenthesis
        ;      

key     :       OLDMD rs = WORD -> ^(OLDMD[$rs])
        |       NEWMD rs = WORD -> ^(NEWMD[$rs])
        ;





/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

// chars used for words might need to be extended
fragment CHAR  
        :       'a'..'z' | 'A'..'Z' | '0'..'9'
        ;

// chars used in regex
fragment REG_CHAR
        :       '|' | '[' | '\\' | '^' | '$' | '.' | '?' | '*' | '+' | '(' | ')'
        ;

fragment RESERVED
        :       SEP | ESCAPE | EQ
        ;

// white spaces taps etc
fragment WS
        :       '\t' | ' ' | '\r' | '\n'| '\u000C'
        ;

fragment ESCAPE
        :       '/'
        ;      

MT
        :       '~' {regexExpected = true;}
        ;

REGEX
@after{regexExpected = false;}
        :       {regexExpected}?=> WS* ~WS+
        ;

LINE_COMMENT
        :       '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
        ;

WHITESPACE
        :       WS+    { $channel = HIDDEN; }
        ;

COMMENT
        :       '/*' .* '*/' {$channel=HIDDEN;}
        ;

WORD    :       CHAR+
        ;

which will create the following AST for the example input you posted:

enter image description here

However, realize this is just a quick demo. The way I defined REGEX now is that it will consume any non-space chars it sees. In other words, to end a REGEX, you'd have to place a space directly after it. Adjust my example to suit your own needs.

Good luck!

PS. By the way, the odd { ... }?=> syntax in the REGEX rule is called a "gated semantic predicate". Learn more about it here: What is a 'semantic predicate' in ANTLR?

Thanks 4 the help, I rethought the whole think an designed it in a different way (using semantic predicates). If anyone is interested I can post the final result. — user1549692, Aug 24 '12 at 11:56