Lark parser can't parse characters, even though they are defined in regex of rule

Question

I'm trying to write a SMTP parser and took some information for quoted strings from the rfc. So I have the following grammar (taken out all the parts that work, focusing on the thing that doesn't):

quoted_string  : /[\x22]/ qcontentsmtp* /[\x22]/
qcontentsmtp   : qtextsmtp | quoted_pairsmtp
quoted_pairsmtp  : /[\x5C\x5C]/ /[\x20-\x7E]/
qtextsmtp      : /[\x20-\x21|\x23-\x5B|\x5D-\x7E]/

command : [ quoted_string ]

With the only start for the parser being the command-rule.

When I input "quoted_string", I would expect it to be parsed as such:

command -> quoted_string -> qcontentsmtp -> qtextsmtp

As you can see, qtextsmtp contains alphanumeric characters, coded as a regex, as shown in the rfc. However, when I try to parse it, I get this message:

input = '"quoted_string"'
....
####### Parsing Failed
No terminal defined for 'q' at line 1 col 2

"quoted_string"
 ^

when I input just "" it works as expected.

When I change the rule qtextsmtp and exchange the regex for "a" and make the input be '"a"' it also works.

I defined all the rules as functions in my transformer, very basic, like so:

class StringsTransformer(Transformer):
# externals
def quoted_string(self, args):
    return "".join(args)

# internals
def qcontentsmtp(self, args):
    return "".join(args)

def quoted_pairsmtp(self, args):
    return "".join(args)

def qtextsmtp(self, args):
    return "".join(args)

But I don't even get to those rules because, as I said, it won't even parse.

I'm not quite sure why the regex doesn't work. I use these type of rules in other parts and they work just fine, just with this one it doesn't.

what does the forward slash mean in your grammar? another example where I've seen it is in the following `/.*?/`. In your in the `/[\x22]/ qcontentsmtp* /[\x22]/`. — Charlie Parker, Jul 07 '22 at 16:44

score 1 · Answer 1 · answered Nov 09 '19 at 22:04

1

I'd recommend using string literals in terminals if you can; even though they won't match the RFC identically, they certainly work in the existing lark parser implementation. (Your example fails for me too, but using the below works. Not sure I understand the underpinnings as to why.)

DOUBLE_QUOTED_STRING  : /"[^"]*"/

reference from the lark src.

How are you defining your grammar? You may need to escape your \ backslashes, if you are defining it inline in your code (vs reading from a file).

answered Nov 09 '19 at 22:04

j6m8

2,261
2
26
34

1

Thanks for the answer. As I said in my question, I would really like for my lark parser to match the RFC as much as possible. I already had to change some parts and would really like to keep that to a minimum. As it turns out however, doing that in this case is not necessarily possible. Please check out [the above answer](https://stackoverflow.com/a/58784184/4687402) – Benjamin Basmaci Nov 28 '19 at 10:55
what does the forward slash mean in your grammar? another example where I've seen it is in the following `/.*?/`. – Charlie Parker Jul 07 '22 at 16:44
This represents a [regular expression](https://en.wikipedia.org/wiki/Regular_expression), which is a general grammar for string-matching (specifically, for string-matching of regular languages... But that's maybe in the weeds here!) – j6m8 Jul 08 '22 at 17:11

user4815162342 · Accepted Answer · 2019-11-10T17:54:23.660

1

It seems like Lark's regexp parser is confused with the quoting of [ and ] as \x5b and \x5d respectively, and the q letter simply doesn't match the regexp. After replacing \x5b with \[ and \x5d with \], the grammar parses the provided input, as shown by the following program:

import lark

grammar = r"""
quoted_string  : /[\x22]/ qcontentsmtp* /[\x22]/
qcontentsmtp   : qtextsmtp | quoted_pairsmtp
quoted_pairsmtp  : /[\x5C\x5C]/ /[\x20-\x7E]/
qtextsmtp      : /[\x20-\x21\x23-\[\]-\x7E]/

command : [ quoted_string ]
"""

parser = lark.Lark(grammar, start='command')

print(parser.parse('"quoted_string"'))

(Note that | is superfluous in character sets, it is interpreted as just another character to match.)

This is not a general limitation of Python regexps, which are perfectly capable of accepting [ and ] escaped in hex:

>>> re.compile(r'[\x23-\x5b\x5d-\x7e]').match('q')
<re.Match object; span=(0, 1), match='q'>

I've now reported the issue to the maintainers of Lark.

edited Nov 10 '19 at 17:54

answered Nov 09 '19 at 22:41

user4815162342

141,790
18
296
355

Works beautifully! And thank you for reporting the issue. As you stated there, I did expect the hex code to be escaped and I 100% support the notion that this is a usability issue. – Benjamin Basmaci Nov 28 '19 at 10:52
what does the forward slash mean in your grammar? another example where I've seen it is in the following `/.*?/`. In your in the `/[\x22]/ qcontentsmtp* /[\x22]/`. – Charlie Parker Jul 07 '22 at 16:44
1

@CharlieParker It's a Lark thing, `/.../` terminals contain regular expressions. See [the docs here](https://lark-parser.readthedocs.io/en/latest/grammar.html#terminals). – user4815162342 Jul 07 '22 at 19:27
thanks! I did see the docs but it didn't explicitly say thats what it meant plus there was a random + sign e.g. `/regular expression+/` which personally really threw me off. Confirmation is nice I appreciate it. – Charlie Parker Jul 07 '22 at 19:39

Lark parser can't parse characters, even though they are defined in regex of rule

2 Answers2

Linked