1

We are trying to parse queries in the following form:

Taiwan OR China
Taiwan OR "Republic of China"

Essentially binary operators like OR/AND/NOT would be used to construct such queries and quotes are used to mark a term that contains multiple words. Our goal is then to extract the individual names here:

  • Taiwan and China in the first case
  • Taiwan and Republic of China in the second case

(The problem is more complex but this is a first milestone)

Starting with the basics, we would have the following for the first use case

grammar Query;
parse : expr EOF ;
expr : name binop name ;
binop : 'AND' | 'OR' | 'NOT' ;
name
  :  WORD
  ;
WORD              : ('a' .. 'z' | 'A' .. 'Z')+ ;
WS : [ \t\r\n]+ -> skip ;

When trying to expand this to capture quotes and handle spaces for terms within quotes we struggled a bit.

We tried something like this:

grammar Query;
parse : expr EOF ;
expr : name binop name ;
binop : 'AND' | 'OR' | 'NOT' ;
name
  :  WORD
  | '"' NAME_WITH_SPACES '"'
  ;
WORD              : ('a' .. 'z' | 'A' .. 'Z')+ ;
NAME_WITH_SPACES  : ('a' .. 'z' | 'A' .. 'Z' | ' ')+ ;
WS : [ \t\r\n]+ -> skip ;

More specifically, the output is:

line 1:0 mismatched input 'TAIWAN OR CHINA' expecting {'"', WORD}

respectively:

line 1:0 extraneous input 'TAIWAN OR ' expecting {'"', WORD}
line 1:29 mismatched input '<EOF>' expecting {'AND', 'OR', 'NOT'}

We appreciate there might be friction when attempting to contain spaces within quotes, while at the same time skipping them outside quotes.

Any ideas would be welcome - being new two this it's hard to tell how to accommodate these conflicting requirements around whitespace.

2 Answers2

1

No, this:

name
  :  WORD
  | '"' NAME_WITH_SPACES '"'
  ;

...

NAME_WITH_SPACES  : ('a' .. 'z' | 'A' .. 'Z' | ' ')+ ;

is not the same as:

name
  : WORD
  | NAME_WITH_SPACES
  ;

...

NAME_WITH_SPACES  : '"' ('a' .. 'z' | 'A' .. 'Z' | ' ')+ '"' ;

In the first case, input like Taiwan OR "Republic of China" is tokenised as follows:

  • Taiwan OR (type: NAME_WITH_SPACES)
  • "
  • Republic of China (type: NAME_WITH_SPACES)
  • "

because ANTLR's lexer rules try to match as much characters as possible. So if you let the quotes be included in the NAME_WITH_SPACES lexer rule:

NAME_WITH_SPACES  : '"' ('a' .. 'z' | 'A' .. 'Z' | ' ')+ '"' ;

then the input Taiwan OR "Republic of China" is tokenised as this:

  • Taiwan (type: WORD)
  • OR (type: OR)
  • "Republic of China" (type: NAME_WITH_SPACES)

and spaces outside quoted tokens are properly skipped.

Note that you can write it like this:

WORD              : [a-zA-Z]+ ;
NAME_WITH_SPACES  : '"' [a-zA-Z ]+ '"' ;

Also see this related Q&A: Practical difference between parser rules and lexer rules in ANTLR?

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Great, that is a very useful thing to know and clear reasoning for this behaviour then! Thank you, Bart! (And thanks for the doc reference, something I'll get more in depth with) – Cosmin Marginean Sep 05 '19 at 18:17
0

Further on this, we tried the following:

grammar Query;
parse : expr EOF ;
expr : name binop name ;
binop : 'AND' | 'OR' | 'NOT' ;
name
  : WORD
  | NAME_WITH_SPACES
  ;
WORD              : ('a' .. 'z' | 'A' .. 'Z')+ ;
NAME_WITH_SPACES  : '"' ('a' .. 'z' | 'A' .. 'Z' | ' ')+ '"' ;
WS : [ \t\r\n]+ -> skip ;

This seems to work reasonably well, even though to me it seems semantically identical with our first attempt which didn't work:

grammar Query;
parse : expr EOF ;
expr : name binop name ;
binop : 'AND' | 'OR' | 'NOT' ;
name
  :  WORD
  | '"' NAME_WITH_SPACES '"'
  ;
WORD              : ('a' .. 'z' | 'A' .. 'Z')+ ;
NAME_WITH_SPACES  : ('a' .. 'z' | 'A' .. 'Z' | ' ')+ ;
WS : [ \t\r\n]+ -> skip ;