Antlr4 actions and predicates in lexer fragments

Question

I am trying to create lexer rules for dynamically determined batch separator in Antlr4. This supports two use cases:

different database systems define their own batch separators (e.g. 'go', ';' '/')
I also want to allow user defined batch separators which may be up to 2 characters long, and potentially could be anything, but for thisvexample, let's assume they are ascii characters.

So, for the purposes of this example, batch separators are any character string that is on its own line by itself and matches the currently known batch separator. There are several more complicated alternatives, but I want to keep this simple because the question is about actions and semantic predicates in lexer fragments, not about batch separators.

So assume that I have already defined a lexer rule called ALPHA that matches any upper or lower case alphabet. Also, assume I am only trying to match '\r'?'\n'<some-upto-two-char-string>'\r'?'\n' i.e. a batch separator on its own line and no other whitespaces to contend with

I define the following lexer rules:

    BATCH_SEPARATOR:
     NEWLINE ALPHA (ALPHA)? NEWLINE
    ;
    NEWLINE: '\r'?'\n';

This rule works well for most situations, except it does not account for dynamically matching the input batch separator candidate to the valid value of batch separator. So, although it will successfully lex 'go', ';', etc. it will incorrectly lex 'IN', 'AS', etc. as batch separators when they appear on their own line as part of a SELECT or CREATE FUNCTION statement.

So now I take the next step of checking for the actual character match and define a method called isValidBatchSeparator() in @lexer::members section. This method essentially compares the known batch separator (as established by the application) to _input.LA(-1) and _input.LA(1) which looks something like the following (in pseudo-code):

private char[] _batchSeparator; // assume it is already set to some value and this array has at least size 1
public boolean isValidBatchSeparator()
{
   if batchSeparatorLength > 1
      if _batchSeparator[0] == (ignore case) _input.LA(-1) 
         && _batchSeparator[1] == (ignore case) _input.LA(1)
         return true
   else
    {
      if _batchSeparator[0] == (ignore case) _input.LA(-1)
      return true
    }
   return false

}

So now I write my lexer rules as:

BATCH_SEPARATOR:
  NEWLINE BATCH_SEP_INNER NEWLINE
;
fragment BATCH_SEP_INNER:
  ALPHA (ALPHA)? {isValidBatchSeparator()}?
;

This appears to transpile correctly.

I am able to step through code and verify that the semantic predicate is indeed being entered and the method returns the correct value. However, an input like '\r\nGO\r\n' doesn't get lexed as a BATCH_SEPARATOR. Instead somewhere later in the code I have a definition for IDENTIFIER that generically defines identifiers as a bunch of characters, which sort of catches this string as it falls through from the BATCH_SEPARATOR rule. So apparently, semantic predicates applied to fragments are not the same as semantic predicates applied to non-fragment lexer rules.

So I take away fragment from the lexer rule definition and make BATCH_SEP_INNER a first class citizen, but again my lexer semantic predicate fails me and even though the sematic predicate code appears to kick in and return correct values, I still see '\r\nGO\r\n' lexed as IDENTFIER (not even BATCH_SEP_INNER)

I tried some other things like applying the semantic predicate to BATCH_SEPARATOR instead of BATCH_SEP_INNER. The problem here is that _input.LA(-1) and _input.LA(1) now correspond to '\r' and '\n' and there is no clean way to get to the ascii that actually represents the batch separator. For example, in more complicated situations where there are white spaces as well or there are several NEWLINEs before the batch separator ascii.

Thus, applying this semantic predicate to BATCH_SEPARATOR would always fail to match and my input string won't get lexed correctly.

I also tried spitting up isValidBatchSeparator() into two, by applying it as an action that stores the output of this method into a variable and then using that variable in semantic predicate applied to BATCH_SEPARATOR. Something like this:

BATCH_SEPARATOR:
 (NEWLINE BATCH_SEP_INNER NEWLINE) {_isValidBatchSeparator}?
;

fragment BATCH_SEP_INNER:
  {_isValidBatchSeparator = isValidBatchSeparator();}
  (ALPHA (ALPHA)?)
;

If I do that, I get a warning that an action has been defined in a fragment and hence won't ever run. Obviously, making it non-fragment breaks a lot more things that it fixes because as a non-fragment rule, BATCH_SEP_INNER matches any two char strings and breaks many, many things.

So as a last resort I try something clever:

BATCH_SEPARATOR:
 (NEWLINE BATCH_SEP_INNER NEWLINE) {_isValidBatchSeparator}?
;

BATCH_SEP_INNER:
  {_isValidBatchSeparator = isValidBatchSeparator();}
  (ALPHA (ALPHA)?)
  {_isValidBatchSeparator}?
;

In the last step, I'm intending to make BATCH_SEP_INNER run the action, but disable the lexer rule after having done that, if the batch separator is not valid. However, the transpiled Lexer actually skips the entire action. I can see corresponding code getting generated, but the code path is never traversed.

So now I'm out of ideas and looking for this forum for help :)

Things to clarify : _although it will successfully lex 'go', ';', etc._ -> does ALPHA match a semicolon ? _it will incorrectly lex 'IN', 'AS'_ -> why ? Did you run with `-tokens` option and see how the lexer interprets the input ? " _\r\nGO\r\n' lexed as IDENTIFIER_ -> at first glance it seems you have a lexer ambiguity problem, see [disambiguate](https://stackoverflow.com/questions/46267980/how-does-the-antlr-lexer-disambiguate-its-rules-or-why-does-my-parser-produce) and [also here](https://stackoverflow.com/questions/46606030/antlr4-unexpectedly-stops-parsing-expression/46615203#46615203). — BernardK, Oct 25 '17 at 07:03
... and please provide a few lines of input and how you want them to be interpreted. Wouldn't be easier to have a parser rule `batch_separator : {_isValidBatchSeparator}? NL ID NL;` ? — BernardK, Oct 25 '17 at 08:24
@BernardK, I know I haven't done a good job of describing my problem. I have inherited an old Antlr2 grammar that needs to be converted to Antlr4, so I assumed that lexer rules in prior version should map to lexer rules in the new grammar. That is true in general, but as you point out, there is probably benefit in exploring doing this through parser rules. Thanks for the pointer. — Cod.ie, Oct 26 '17 at 02:10

Antlr4 actions and predicates in lexer fragments

0 Answers0