2

I'm trying to write a SMTP parser and took some information for quoted strings from the rfc. So I have the following grammar (taken out all the parts that work, focusing on the thing that doesn't):

quoted_string  : /[\x22]/ qcontentsmtp* /[\x22]/
qcontentsmtp   : qtextsmtp | quoted_pairsmtp
quoted_pairsmtp  : /[\x5C\x5C]/ /[\x20-\x7E]/
qtextsmtp      : /[\x20-\x21|\x23-\x5B|\x5D-\x7E]/

command : [ quoted_string ]

With the only start for the parser being the command-rule.

When I input "quoted_string", I would expect it to be parsed as such:

command -> quoted_string -> qcontentsmtp -> qtextsmtp

As you can see, qtextsmtp contains alphanumeric characters, coded as a regex, as shown in the rfc. However, when I try to parse it, I get this message:

input = '"quoted_string"'
....
####### Parsing Failed
No terminal defined for 'q' at line 1 col 2

"quoted_string"
 ^

when I input just "" it works as expected.

When I change the rule qtextsmtp and exchange the regex for "a" and make the input be '"a"' it also works.

I defined all the rules as functions in my transformer, very basic, like so:

class StringsTransformer(Transformer):
# externals
def quoted_string(self, args):
    return "".join(args)

# internals
def qcontentsmtp(self, args):
    return "".join(args)

def quoted_pairsmtp(self, args):
    return "".join(args)

def qtextsmtp(self, args):
    return "".join(args)

But I don't even get to those rules because, as I said, it won't even parse.

I'm not quite sure why the regex doesn't work. I use these type of rules in other parts and they work just fine, just with this one it doesn't.

Community
  • 1
  • 1
Benjamin Basmaci
  • 2,247
  • 2
  • 25
  • 46
  • what does the forward slash mean in your grammar? another example where I've seen it is in the following `/.*?/`. In your in the `/[\x22]/ qcontentsmtp* /[\x22]/`. – Charlie Parker Jul 07 '22 at 16:44

2 Answers2

1

I'd recommend using string literals in terminals if you can; even though they won't match the RFC identically, they certainly work in the existing lark parser implementation. (Your example fails for me too, but using the below works. Not sure I understand the underpinnings as to why.)

DOUBLE_QUOTED_STRING  : /"[^"]*"/

reference from the lark src.

How are you defining your grammar? You may need to escape your \ backslashes, if you are defining it inline in your code (vs reading from a file).

j6m8
  • 2,261
  • 2
  • 26
  • 34
  • 1
    Thanks for the answer. As I said in my question, I would really like for my lark parser to match the RFC as much as possible. I already had to change some parts and would really like to keep that to a minimum. As it turns out however, doing that in this case is not necessarily possible. Please check out [the above answer](https://stackoverflow.com/a/58784184/4687402) – Benjamin Basmaci Nov 28 '19 at 10:55
  • what does the forward slash mean in your grammar? another example where I've seen it is in the following `/.*?/`. – Charlie Parker Jul 07 '22 at 16:44
  • This represents a [regular expression](https://en.wikipedia.org/wiki/Regular_expression), which is a general grammar for string-matching (specifically, for string-matching of regular languages... But that's maybe in the weeds here!) – j6m8 Jul 08 '22 at 17:11
1

It seems like Lark's regexp parser is confused with the quoting of [ and ] as \x5b and \x5d respectively, and the q letter simply doesn't match the regexp. After replacing \x5b with \[ and \x5d with \], the grammar parses the provided input, as shown by the following program:

import lark

grammar = r"""
quoted_string  : /[\x22]/ qcontentsmtp* /[\x22]/
qcontentsmtp   : qtextsmtp | quoted_pairsmtp
quoted_pairsmtp  : /[\x5C\x5C]/ /[\x20-\x7E]/
qtextsmtp      : /[\x20-\x21\x23-\[\]-\x7E]/

command : [ quoted_string ]
"""

parser = lark.Lark(grammar, start='command')

print(parser.parse('"quoted_string"'))

(Note that | is superfluous in character sets, it is interpreted as just another character to match.)

This is not a general limitation of Python regexps, which are perfectly capable of accepting [ and ] escaped in hex:

>>> re.compile(r'[\x23-\x5b\x5d-\x7e]').match('q')
<re.Match object; span=(0, 1), match='q'>

I've now reported the issue to the maintainers of Lark.

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • Works beautifully! And thank you for reporting the issue. As you stated there, I did expect the hex code to be escaped and I 100% support the notion that this is a usability issue. – Benjamin Basmaci Nov 28 '19 at 10:52
  • what does the forward slash mean in your grammar? another example where I've seen it is in the following `/.*?/`. In your in the `/[\x22]/ qcontentsmtp* /[\x22]/`. – Charlie Parker Jul 07 '22 at 16:44
  • 1
    @CharlieParker It's a Lark thing, `/.../` terminals contain regular expressions. See [the docs here](https://lark-parser.readthedocs.io/en/latest/grammar.html#terminals). – user4815162342 Jul 07 '22 at 19:27
  • thanks! I did see the docs but it didn't explicitly say thats what it meant plus there was a random + sign e.g. `/regular expression+/` which personally really threw me off. Confirmation is nice I appreciate it. – Charlie Parker Jul 07 '22 at 19:39