We are trying to parse queries in the following form:
Taiwan OR China
Taiwan OR "Republic of China"
Essentially binary operators like OR/AND/NOT would be used to construct such queries and quotes are used to mark a term that contains multiple words. Our goal is then to extract the individual names here:
- Taiwan and China in the first case
- Taiwan and Republic of China in the second case
(The problem is more complex but this is a first milestone)
Starting with the basics, we would have the following for the first use case
grammar Query;
parse : expr EOF ;
expr : name binop name ;
binop : 'AND' | 'OR' | 'NOT' ;
name
: WORD
;
WORD : ('a' .. 'z' | 'A' .. 'Z')+ ;
WS : [ \t\r\n]+ -> skip ;
When trying to expand this to capture quotes and handle spaces for terms within quotes we struggled a bit.
We tried something like this:
grammar Query;
parse : expr EOF ;
expr : name binop name ;
binop : 'AND' | 'OR' | 'NOT' ;
name
: WORD
| '"' NAME_WITH_SPACES '"'
;
WORD : ('a' .. 'z' | 'A' .. 'Z')+ ;
NAME_WITH_SPACES : ('a' .. 'z' | 'A' .. 'Z' | ' ')+ ;
WS : [ \t\r\n]+ -> skip ;
More specifically, the output is:
line 1:0 mismatched input 'TAIWAN OR CHINA' expecting {'"', WORD}
respectively:
line 1:0 extraneous input 'TAIWAN OR ' expecting {'"', WORD}
line 1:29 mismatched input '<EOF>' expecting {'AND', 'OR', 'NOT'}
We appreciate there might be friction when attempting to contain spaces within quotes, while at the same time skipping them outside quotes.
Any ideas would be welcome - being new two this it's hard to tell how to accommodate these conflicting requirements around whitespace.