antlr grammar: Allow whitespace matching only in template string

Question

I want to parse template strings:

`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`

Here is my grammar:

varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)*  ')' ;

WS      : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;

When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:

extraneous input ' ' expecting {'`'}

How can I allow whitespaces to be parsed and not skipped only inside the template string?

Related: https://stackoverflow.com/questions/53504903/parse-string-antlr — sepp2k, Apr 17 '19 at 13:19
Martin, I think your question isn't actually about whitespace since your parsing problems have nothing to do with it (see my answer). I think you could improve on the question by focusing it more on the symptons (string is not recognized even though the rules seem fine at first glance). If phrased that way it could help future visitors with similar problems. If you want, I could try and edit your question. Let me know :) — AplusKminus, Apr 18 '19 at 19:46
Would you mind accepting the answer if it solved your problem? Or provide more details if it didn't? — AplusKminus, Apr 28 '19 at 20:31

AplusKminus · Accepted Answer · 2019-04-18T19:41:30.977

What is currently happening

When testing your example against your current grammar displaying the generated tokens, the lexer gives this:

[@0,0:0='`',<'`'>,1:0]
[@1,1:4='Some',<VAR>,1:1]
[@2,6:9='text',<VAR>,1:6]
[@3,11:12='${',<'${'>,1:11]
[@4,13:20='variable',<VAR>,1:13]
[@5,21:21='.',<'.'>,1:21]
[@6,22:25='name',<VAR>,1:22]
[@7,26:26='}',<'}'>,1:26]
... shortened ...
[@26,85:84='<EOF>',<EOF>,2:0]

This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?

As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.

What you could try (Spoiler: won't work)

You could try to modify the rule like this:

TemplateStringLiteral: ('\\`' | ~'`')+ ;

so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:

How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.

How to achieve what you actually want

There might be another solution, but this one works:

File MartinCup.g4:

parser grammar MartinCup;

options { tokenVocab=MartinCupLexer; }

templateString
    : BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
    ;

template
    : TemplateStart variable TemplateEnd
    ;

variable
    : varname funParameter? (Dot variable)*
    ;

varname
    : VAR
    ;

funParameter
    : OpenPar variable? (Comma variable)* ClosedPar
    ;

File MartinCupLexer.g4:

lexer grammar MartinCupLexer;

BackTick : '`' ;

TemplateStart
    : '${' -> pushMode(templateMode)
    ;

TemplateStringLiteral
    : '\\`'
    | ~'`'
    ;

mode templateMode;

VAR
    : [$]?[a-zA-Z0-9_]+
    | [$]
    ;

OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;

TemplateEnd
    : '}' -> popMode;

This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.

Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.

About the whitespaces

I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.

I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:

line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}

The reason for this is the same as above, Some is lexed to VAR.

Note: You can still use string literals in parser grammars for lexer rules that are defined using a single string literal in the lexer grammar. That is, you can still write `','` instead of COMMA in the parser grammar even though you have a separate lexer grammar. — sepp2k, Apr 18 '19 at 20:04

antlr grammar: Allow whitespace matching only in template string

1 Answers1

What is currently happening

What you could try (Spoiler: won't work)

How to achieve what you actually want

About the whitespaces