Parse arbitrary delimiter character using Antlr4

Question

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?

To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:

REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;

This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).

If I had to solve this using a regular expression itself, I would use a backreference like this:

m(.)(.+?)\1

(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.

It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.

Can this be solved with Antlr4?

score 4 · Accepted Answer · edited May 23 '17 at 11:58

You could do something like this:

lexer grammar TLexer;

REGEX
 : REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
 | '{' REGEX_ATOM+ '}'
 | '(' REGEX_ATOM+ ')'
 ;

ANY
 : .
 ;

fragment REGEX_DELIMITER
 : [/~@#]
 ;

fragment REGEX_ATOM
 : '\\' .
 | ~[\\]
 ;

If you run the following class:

public class Main {

  public static void main(String[] args) throws Exception {

    TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));

    for (Token t : lexer.getAllTokens()) {
      System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
    }
  }
}

you will see the following output:

REGEX                /foo/
ANY                   
ANY                  /
ANY                  b
ANY                  a
ANY                  r
ANY                  \
ANY                   
REGEX                ~\~~
ANY                   
REGEX                {mu}
ANY                   
ANY                  (
ANY                  b
ANY                  l
ANY                  a
ANY                  (

The {...}? is called a predicate:

The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).

Tested with ANTLR 4.5.3

EDIT

And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):

lexer grammar TLexer;

  @lexer::members {
    boolean delimiterAhead(String start) {
      return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
    }
  }

  REGEX
   : '/' ( '\\' . | ~[/\\] )+ '/'
   | 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
   | 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
   | 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
   ;

  ANY
   : .
   ;

  fragment REGEX_DELIMITER
   : [~@#]
   ;

  fragment SPACES
   : [ \t]+
   ;

This looks quite promising. However, there is still a small problem: If I want to mimic Perl's syntax, then a regex with a delimiter other than a slash _must_ be preceded by the character "m" (and optional spaces). So the following should be valid: `m ~foo~` But then I can no longer use `getText().charAt(0)` because the delimiter is not at position 0. Because of the optional spaces, it is also not at position 1. In fact, I cannot predict the position at all, so how could this be rewritten? — chschroe, Jul 07 '16 at 00:53
I had the idea to use an action to store the delimiter in a local variable, but local variables are not allowed in lexer rules. — chschroe, Jul 07 '16 at 00:55
Great trick! I did some minor modifications, but followed the basic idea. One last question remaining: By using a custom function, I seem to be bound to a single target language (Java in this case). Which would be the right approach if I wanted to support multiple target languages (e.g. JavaScript and Java)? — chschroe, Jul 07 '16 at 16:56
If you're retargetting for another language, there's no other way than rewriting the code in the `@members` block and in the predicate blocks: `{...}?` — Bart Kiers, Jul 07 '16 at 20:03

Parse arbitrary delimiter character using Antlr4

1 Answers1

EDIT

Linked