1

I'm trying to create a new rule in the R grammar for Raw Strings.

Quote of the R news:

There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.

Examples:

## A Windows path written as a raw string constant:
r"(c:\Program files\R)"

## More raw strings:
r"{(\1\2)}"
r"(use both "double" and 'single' quotes)"
r"---(\1--)-)---"

But I'm unsure if a grammar file alone is enough to implement the rule. Until now I tried something like this as a basis from older suggestions of similar grammars:

Parser:

|   RAW_STRING_LITERAL #e42

Lexer:

RAW_STRING_LITERAL
        : ('R' | 'r') '"' ( '\\' [btnfr"'\\] | ~[\r\n"]|LETTER )* '"' ; 

Any hints or suggestions are appreciated.

R ANTLR Grammar:

https://github.com/antlr/grammars-v4/blob/master/r/R.g4

Original R Grammar in Bison:

https://svn.r-project.org/R/trunk/src/main/gram.y

Marcel
  • 502
  • 3
  • 11

1 Answers1

0

To match start- and end-delimiters, you will have to use target specific code. In Java that could look like this:

@lexer::members {
  boolean closeDelimiterAhead() {
    // Get the part between `r"` and `(`
    String delimiter = getText().substring(2, getText().indexOf('('));

    // Construct the end of the raw string
    String stopFor = ")" + delimiter + "\"";

    for (int n = 1; n <= stopFor.length(); n++) {
      if (this._input.LA(n) != stopFor.charAt(n - 1)) {
        // No end ahead yet
        return false;
      }
    }

    return true;
  }
}

RAW_STRING
 : [rR] '"' ~[(]* '(' ( {!closeDelimiterAhead()}? . )* ')' ~["]* '"'
 ;

which tokenizes r"---( )--" )----" )---" as a single RAW_STRING.

EDIT

And since the delimiters can only consist of hyphens (and parenthesis/braces) and not just any arbitrary character, this should do it as well:

RAW_STRING
 : [rR] '"' INNER_RAW_STRING '"'
 ;

fragment INNER_RAW_STRING
 : '-' INNER_RAW_STRING '-'
 | '(' .*? ')'
 | '{' .*? '}'
 | '[' .*? ']'
 ;
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Hello Bart. Thanks for the fast answer. I will try that tomorrow. Just two questions. You added in additon SPACES. But is it used anwhere for the raw string? Can the target specific code also be externalized in a listener? – Marcel Nov 02 '20 at 18:55
  • @Marcel "But is it used anwhere for the raw string?" no, it was just for my own test. "Can the target specific code also be externalized in a listener?", no, unfortunately not. – Bart Kiers Nov 02 '20 at 21:05
  • Hello Bart. Works fine so far! Thanks for your help. Just one further question. Is this rule not context sensitive? It seems that I don't have to declare 'r"' and 'R"' at the beginning of the rule according to this explanation: https://stackoverflow.com/questions/5126779/parsing-context-sensitive-language – Marcel Nov 03 '20 at 10:39
  • Are the delimiters always a fixed (possibly repeated) character, like `-`, `----` etc? If that is the case, then there probably is a way for it to be matched without the predicate (which is what you're asking, I beleive: a way to match the token without the predicate) – Bart Kiers Nov 03 '20 at 12:14
  • From the documentation: Raw character constants are also available using a syntax similar to the one used in C++: r"(...)" with ... any character sequence, except that it must not contain the closing sequence )". The delimiter pairs [] and {} can also be used, and R can be used in place of r. For additional flexibility, a number of dashes can be placed between the opening quote and the opening delimiter, as long as the same number of dashes appear between the closing delimiter and the closing quote. – Marcel Nov 03 '20 at 14:21
  • Yeah, then it should be possible without a predicate. Will have a look at this in a couple of hours. – Bart Kiers Nov 03 '20 at 16:44
  • @Marcel checkout my **EDIT** – Bart Kiers Nov 03 '20 at 16:57
  • Great, thanks. Should this delimiter rule also be added: | '[' .*? ']' in the fragment as noted in the docs? – Marcel Nov 03 '20 at 17:59
  • Oh yaah, missed that one. – Bart Kiers Nov 03 '20 at 18:49