I'm using lark, an excellent python parsing library.
It provides an Earley and LALR(1) parser and is defined through a custom EBNF
format. (EBNF stands for Extended Backus–Naur form).
Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.
I'm trying to define a grammar but I'm stuck with a behavior I can't seem to balance.
I have some rules with unnamed literals (the strings or characters between double-quotes):
directives: directive+
directive: "@" NAME arguments ?
directive_definition: description? "directive" "@" NAME arguments? "on" directive_locations
directive_locations: "SCALAR" | "OBJECT" | "ENUM"
arguments: "(" argument+ ")"
argument: NAME ":" value
union_type_definition: description? "union" NAME directives? union_member_types?
union_member_types: "=" NAME ("|" NAME)*
description: STRING | LONG_STRING
STRING: /("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/i
LONG_STRING: /(""".*?(?<!\\)(\\\\)*?"""|'''.*?(?<!\\)(\\\\)*?''')/is
NAME.2: /[_A-Za-z][_0-9A-Za-z]*/
It works well for 99% of use case. But if, in my parsed language, I use a directive
which is called directive
, everything breaks:
union Foo @something(test: 42) = Bar | Baz # This works
union Foo @directive(test: 42) = Bar | Baz # This fails
Here, the directive
string is matched on the unnamed literal in the directive_definition
rule when it should match the NAME.2
terminal.
How can I balance / adjust this so there is no ambiguity possible for the LALR(1) parser ?