18

I was drawn to Raku due to its built-in grammars and figured I'd play around with it and write a simple email address parser, only problem: I couldn't get it to work.

I tried countless iterations before landing on something that actually works, and I'm struggling to understand why.

All it boiled down to, was changing token to rule.

Here's my example code:

grammar Email {
  token TOP { <name> '@' [<subdomain> '.']* <domain> '.' <tld> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse('foo.bar@baz.example.com');

doesn't work, it simply prints Nil, but

grammar Email {
  rule TOP { <name> '@' [<subdomain> '.']* <domain> '.' <tld> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse('foo.bar@baz.example.com');

does work and correctly prints

「foo.bar@baz.example.com」
 name => 「foo.bar」
 subdomain => 「baz」
 domain => 「example」
 tld => 「com」

And all I changed was token TOP to rule TOP.

From what I can gather from the documentation, the only difference between those two keywords, is that whitespace is significant in rule, but isn't in token. If that's true, the first example should work, as I want to ignore the whitespace between the individual pieces of the pattern.

Removing the spaces between the pieces

rule TOP { <name>'@'[<subdomain>'.']*<domain>'.'<tld> }

reverts the behaviour back to printing Nil.

Anyone able to clue me in on what's going on here?

EDIT: Changing the TOP rule to a regex instead, which allows for backtracking makes it work too.

The question still remains, how come rule { } (which is the same as regex {:ratchet :sigspace }) matches when token { } (which is the same as regex {:ratchet }) doesn't?

The email address doesn't have any spaces in it, so for all intents and purposes it should fail right away

raiph
  • 31,607
  • 3
  • 62
  • 111
Electric Coffee
  • 11,733
  • 9
  • 70
  • 131
  • 2
    @raiph I added `[context-free-grammar]` because I originally didn't know any better. A few hours later I had read more up on it all, and you're right, it is closer to unrestricted grammar – Electric Coffee May 28 '20 at 06:34
  • 1
    Friedl in "Mastering Regular Expressions" (O'Reilly, 1997; page 294 et seq) develops a Perl regex to match email addresses,4724 (!) bytes long. But even that one falls short on some corner cases... – vonbrand Jul 18 '20 at 15:31

3 Answers3

15

This answer explains the problem, provides a simple solution, and then goes deep.

The problem with your grammar

First, your SO demonstrates what seems to be either an extraordinary bug or a common misunderstanding. See JJ's answer for the issue he's filed to follow up on it, and/or my footnote.[4]

Putting the bug/"bug" aside, your grammar directs Raku to not match your input:

  • The [<subdomain> '.']* atom eagerly consumes the string 'baz.example.' from your input;

  • The remaining input ('com') fails to match the remaining atoms (<domain> '.' <tld>);

  • The :ratchet that's in effect for tokens means the grammar engine does not backtrack into the [<subdomain> '.']* atom.

Thus the overall match fails.

The simplest solution

The simplest solution to make your grammar work is to append ! to the [<subdomain> '.']* pattern in your token.

This has the following effect:

  • If any of the remainder of the token fails (after the subdomain atom), the grammar engine will backtrack to the subdomain atom, drop the last of its match repetitions, and then try move forward again;

  • If matching fails again the engine will again backtrack to the subdomain atom, drop another repetition, and try again;

  • The grammar engine will repeat the above actions until either the rest of the token matches or there are no matches of the [<subdomain> '.'] atom left to backtrack on.

Note that adding the ! to the subdomain atom means that the backtracking behaviour is limited to just the subdomain atom; if the domain atom matches, but then the tld atom did not, the token would fail instead of trying to backtrack. This is because the whole point of tokens is that, by default, they don't backtrack to earlier atoms after they've succeeded.

Playing with Raku, developing grammars, and debugging

Nil is fine as a response from a grammar that is known (or thought) to work fine, and you don't want any more useful response in the event of a parse fail.

For any other scenario there are much better options, as summarized in my answer to How can error reporting in grammars be improved?.

In particular, for playing around, or developing a grammar, or debugging one, the best option by far is to install the free Comma and use its Grammar Live View feature.

Fixing your grammar; general strategies

Your grammar suggests two three options1:

  • Parse forwards with some backtracking. (The simplest solution.)

  • Parse backwards. Write the pattern in reverse, and reverse the input and output.

  • Post parse the parse.

Parse forwards with some backtracking

Backtracking is a reasonable approach for parsing some patterns. But it is best minimized, to maximize performance, and even then still carries DoS risks if written carelessly.2


To switch on backtracking for an entire token, just switch the declarator to regex instead. A regex is just like a token but specifically enables backtracking like a traditional regex.

Another option is to stick with token and limit the part of the pattern that might backtrack. One way to do that is to append a ! after an atom to let it backtrack, explicitly overriding the token's overall "ratchet" that would otherwise kick when that atom succeeds and matching moves on to the next atom:

token TOP { <name> '@' [<subdomain> '.']*! <domain> '.' <tld> }
                                         

An alternative to ! is to insert :!ratchet to switch "ratcheting" off for a part of a rule, and then :ratchet to switch ratcheting back on again, eg:

token TOP { <name> '@' :!ratchet [<subdomain> '.']* :ratchet <domain> '.' <tld> }  

(You can also use r as an abbreviation for ratchet, i.e. :!r and :r.)

Parse backwards

A classic parsing trick that works for some scenarios, is to parse backwards as a way to avoid backtracking.

grammar Email {
  token TOP { <tld> '.' <domain> ['.' <subdomain> ]* '@' <name> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse(flip 'foo.bar@baz.example.com').hash>>.flip;
#{domain => example, name => foo.bar, subdomain => [baz], tld => com}

Probably too complicated for most folks' needs but I thought I'd include it in my answer.

Post parse the parse

In the above I presented a solution that introduces some backtracking, and another that avoids it but with significant costs in terms of ugliness, cognitive load etc. (parsing backwards?!?).

There's another very important technique that I overlooked until reminded by JJ's answer.1 Just parse the results of the parse.


Here's one way. I've completely restructured the grammar, partly to make more sense of this way of doing things, and partly to demonstrate some Raku grammar features:

grammar Email {
  token TOP {
              <dotted-parts(1)> '@'
    $<host> = <dotted-parts(2)>
  }
  token dotted-parts(\min) { <parts> ** {min..*} % '.' }
  token parts { \w+ }
}
say Email.parse('foo.bar@baz.buz.example.com')<host><parts>

displays:

[「baz」 「buz」 「example」 「com」]

While this grammar matches the same strings as yours, and post-parses like JJ's, it's obviously very different:

  • The grammar is reduced to three tokens.

  • The TOP token makes two calls to a generic dotted-parts token, with an argument specifying the minimum number of parts.

  • $<host> = ... captures the following atom under the name <host>.

    (This is typically redundant if the atom is itself a named pattern, as it is in this case -- <dotted-parts>. But "dotted-parts" is rather generic; and to refer to the second match of it (the first comes before the @), we'd need to write <dotted-parts>[1]. So I've tidied up by naming it <host>.)

  • The dotted-parts pattern may look a bit challenging but it's actually pretty simple:

    • It uses a quantifier clause (** {min..max}) to express any number of parts provided it's at least the minimum.

    • It uses a modifier clause (% <separator>) which says there must be a dot between each part.

  • <host><parts> extracts from the parse tree the captured data associated with the parts token of the second use in the TOP rule of dotted-parts. Which is an array: [「baz」 「buz」 「example」 「com」].


Sometimes one wants some or all of the the reparsing to happen during the parsing, so that reparsed results are ready when a call to .parse completes.

JJ has shown one way to code what are called actions. This involved:

  • Creating an "actions" class containing methods whose names corresponded to named rules in the grammar;

  • Telling the parse method to use that actions class;

  • If a rule succeeds, then the action method with the corresponding name is called (while the rule remains on the call stack);

  • The match object corresponding to the rule is passed to the action meethod;

  • The action method can do whatever it likes, including reparsing what just got matched.

It's simpler and sometimes better to write actions directly inline:

grammar Email {
  token TOP {
              <dotted-parts(1)> '@'
    $<host> = <dotted-parts(2)>

    # The new bit:
    {
      make (subs => .[ 0 .. *-3 ],
            dom  => .[      *-2 ],
            tld  => .[      *-1 ])

      given $<host><parts>
    }

  }
  token dotted-parts(\min) { <parts> ** {min..*} % '.' }
  token parts { \w+ }
}
.say for Email.parse('foo.bar@baz.buz.example.com') .made;

displays:

subs => (「baz」 「buz」)
dom => 「example」
tld => 「com」

Notes:

  • I've directly inlined the code doing the reparsing.

    (One can insert arbitary code blocks ({...}) anywhere one could otherwise insert an atom. In the days before we had grammar debuggers a classic use case was { say $/ } which prints $/, the match object, as it is at the point the code block appears.)

  • If a code block is put at the end of a rule, as I have done, it is almost equivalent to an action method.

    (It will be called when the rule has otherwise completed, and $/ is already fully populated. In some scenarios inlining an anonymous action block is the way to go. In others, breaking it out into a named method in an action class like JJ did is better.)

  • make is a major use case for action code.

    (All make does is store its argument in the .made attribute of $/, which in this context is the current parse tree node. Results stored by make are automatically thrown away if backtracking subsequently throws away the enclosing parse node. Often that's precisely what one wants.)

  • foo => bar forms a Pair.

  • The postcircumfix [...] operator indexes its invocant:

    • In this case there's just a prefix . without an explicit LHS so the invocant is "it". The "it" was setup by the given, i.e. it's (excuse the pun) $<host><parts>.
  • The * in the index *-n is the invocant's length; so [ 0 .. *-3 ] is all but the last two elements of $<host><parts>.

  • The .say for ... line ends in .made3, to pick up the maked value.

  • The make'd value is a list of three pairs breaking out $<host><parts>.


Footnotes

1 I had truly thought my first two options were the two main ones available. It's been around 30 years since I encountered Tim Toady online. You'd think by now I'd have learned by heart his eponymous aphorism -- There Is More Than One Way To Do It!

2 Beware "pathological backtracking". In a production context, if you have suitable control of your input, or the system your program runs on, you may not have to worry about deliberate or accidental DoS attacks because they either can't happen, or will uselessly take down a system that's rebootable in the event of being rendered unavailable. But if you do need to worry, i.e. the parsing is running on a box that needs to be protected from a DoS attack, then an assessment of the threat is prudent. (Read Details of the Cloudflare outage on July 2, 2019 to get a real sense of what can go wrong.) If you are running Raku parsing code in such a demanding production environment then you would want to start an audit of code by searching for patterns that use regex, /.../ (the ... are metasyntax), :!r (to include :!ratchet), or *!.

3 There's an alias for .made; it's .ast. I think it stands for A Sparse Tree or Annotated Subset Tree and there's a cs.stackexchange.com question that agrees with me.

4 Golfing your problem, this seems wrong:

say 'a' ~~ rule  { .* a } # 「a」

More generally, I thought the only difference between a token and a rule was that the latter injects a <.ws> at each significant space. But that would mean this should work:

token TOP { <name> <.ws> '@' <.ws> [<subdomain> <.ws> '.']* <.ws>
            <domain> <.ws> '.' <.ws> <tld> <.ws>
} 

But it doesn't!

At first this freaked me out. Writing this footnote two months later, I'm feeling somewhat less freaked out.

Part of this is my speculation about the reason I've not been able to find anyone reporting this in the 15 years since the first Raku grammar prototype became available via Pugs. That speculation includes the possibility that @Larry deliberately designed them to work as they do, and it being a "bug" is primarily misunderstanding among the current crop of mere mortals like us trying to provide an explanation for why Raku does what it does based on our analysis of our sources -- roast, the original design documents, the compiler source code, etc.

In addition, given that the current "buggy" behaviour seems ideal and intuitive (except for contradicting the doc), I'm focusing on interpreting my feeling of great discomfort -- during this interim period of unknown length in which I don't understand why it gets it right -- as a positive experience. I hope others can too -- or, much better, figure out what is actually going on and let us know!

raiph
  • 31,607
  • 3
  • 62
  • 111
  • I'm not sure why `say 'a' ~~ rule { .* a } # 「a」` seems problematic to you. Certainly managing wild-characters at the beginning of a string is a hassle, and `rule` seems to help that. Compare to use of the `token` keyword: `say 'a' ~~ token { .* a } # Nil` returns `Nil`, and (I imagine) could prove a frustrating experience to get working using the `token` keyword alone. – jubilatious1 Mar 09 '21 at 21:02
  • @jubilatious1 `say 'a' ~~ token { .* a } # Nil`. That makes sense, because `token` does not backtrack. The `.*` swallows the `a`, does not backtrack, so there's no `a` left to match, so the match fails. But `rule` doesn't backtrack either. So why does it match? Does `rule` matching default to frugal contraire ["By default, quantifiers request a greedy match"](https://docs.raku.org/language/regexes#Greedy_versus_frugal_quantifiers:_?)? – raiph Mar 09 '21 at 22:40
  • 1
    @jubilatious1 Thanks for commenting. I've revisited the whole scene and have suddenly solved the "crime" (it's not a bug, just a misunderstanding). I've decided I will write another answer explaining it all (hopefully in the next few weeks, maybe a month or two); I will thank you in it for triggering my insight. :) – raiph Mar 09 '21 at 23:45
  • TY for the replies. I've been working on trying to sharpen/simplify whitespace-containing captures in the SO post that follows. Trying to implement Rules-vs-Tokens but am hitting an intellectual impasse. Thus I await your updated reply! https://stackoverflow.com/a/66540269/7270649 – jubilatious1 Mar 10 '21 at 18:16
  • 1
    _"But rule doesn't backtrack either. So why does it match?"_ My guess is for `rules` the Regex Engine "atomizes" desired-elements to include (and search for) significant whitespace. I think it may be instructive to look at the following four statements, because (significant) whitespace is what the `rule` keyword seems to enforce: `say 'a' ~~ rule { .? a }; # 「a」` ; `say 'a' ~~ rule { .?a }; # Nil`; `say 'aa' ~~ rule { .? a }; # Nil`; `say 'aa' ~~ rule { .?a }; # 「aa」` – jubilatious1 Mar 10 '21 at 19:18
  • 1
    Right. Building on that, perhaps even more instructive is code like: [`say 'a' ~~ rule { ^ .? {say $/}a $ }` etc](https://glot.io/snippets/fwm65q3j8p) (click to run what I mean in glot.io). – raiph Mar 10 '21 at 21:58
  • Well, I think I understand on an intuitive level what's going on, but I'm still perplexed as to the return values of some of your internal `{$/}` blocks. The quote from Synopsis_5 is thus _"Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations."_ So maybe that's the answer in a nutshell. https://design.raku.org/S05.html – jubilatious1 Mar 11 '21 at 04:35
  • `say 'a' ~~ rule { .? a }; # 「a」` -- In a "scanning initialization" model (ratcheting, BTW), it's conceivable that an initial `.?<.ws>` atom gets filled with the `a` character, but the `a` immediately gets dropped on encountering the second `.` atom. The second `.` atom then picks up (gets filled with) the `a` and the entirety of the first `.?<.ws>` atom is deeemed (ignorable) _"initial whitespace"_ and thrown away. – jubilatious1 Mar 11 '21 at 04:47
8

Edit: this is probably a bug, so the straight answer to the question is whitespace interpretation (in some restricted ways), although the answer in this case seems to be "ratcheting". It shouldn't be, however, and it only happens sometimes, which is why the bug report has been created. Thanks a lot for the question. Anyway, find below a different (and not possibly buggy) way to solve the grammar problem.


It's probably good to use Grammar::Tracer to check what's going on, just download it and put use Grammar::Tracer at the top. In the first case: Grammar with token

Tokens don't backtrack, so the <domain> token is gobbling up everything until it fails. Let's see what's going on with a rule

Grammar with rule

It does backtrack in this case. Which is surprising, since, well, it should not, according to the definition (and whitespace should be significant)

What can you do? It's probably better if you take into account backtracking when dividing the host.

use Grammar::Tracer;

grammar Email {
  token TOP { <name> '@' <host> }  
  token name { \w+ ['.' \w+]* }
    token host { [\w+] ** 2..* % '.' }
}
say Email.parse('foo.bar@baz.example.com');

Here we make sure that we have at least two fragments, divided by a period.

And then you use actions to divide between the different parts of the host

grammar Email {
  token TOP { <name> '@' <host> }  
  token name { \w+ ['.' \w+]* }
  token host { [\w+] ** 2..* % '.' }
}

class Email-Action {
    method TOP ($/) {
    my %email;
    %email<name> = $/<name>.made;
    my @fragments = $/<host>.made.split("\.");
    %email<tld> = @fragments.pop;
    %email<domain> = @fragments.pop;
    %email<subdomain> = @fragments.join(".") if @fragments;
    make %email;

    }
    method name ($/) { make $/ }
    method host ($/) { make $/ }
}
say Email.parse('foo.bar@baz.example.com', actions => Email-Action.new).made;

We pop twice since we know that, at least, we have a TLD and a domain; if there's anything left, it goes to subdomains. This will print, for this

say Email.parse('foo.bar@baz.example.com', actions => Email-Action.new).made;
say Email.parse('foo@example.com', actions => Email-Action.new).made;
say Email.parse('foo.bar.baz@quux.zuuz.example.com', actions => Email-Action.new).made;

The correct answer:

{domain => example, name => 「foo.bar」, subdomain => baz, tld => com}
{domain => example, name => 「foo」, tld => com}
{domain => example, name => 「foo.bar.baz」, subdomain => quux.zuuz, tld => com}

Grammars are incredibly powerful, but also, with its depth-first search, somewhat difficult to debug and wrap your head around. But if there's a part that can be deferred to actions, which, besides, give you a ready-made data structure, why not use it?

I'm aware that does not really answer your question, why a token is behaving differently than a rule, and a rule is behaving as if it were a regex, not using whitespace and also doing ratcheting. I just don't know. The problem is that, in the way you have formulated your grammar, once it's gobbled up the period, it's not going to give it back. So either you somehow include the subdomain and domain in a single token so that it matches, or you will need a non-ratcheting environment like regexes (and, well, apparently rules too) to make it work. Take into account that token and regexes are very different things. They use the same notation and everything, but its behavior is totally different. I encourage you to use Grammar::Tracer or the grammar testing environment in CommaIDE to check the differences.

jjmerelo
  • 22,578
  • 8
  • 40
  • 86
  • _Tokens don't backtrack..._ which means that they ratchet instead. In your colorized graphic you purportedly show a `Rule` that **does** backtrack. Now what I'm trying to hunt down is an reference in the Perl6 foundational documents that says `Rule`s can't. – jubilatious1 Mar 10 '21 at 01:42
  • 1
    Found it! _"In particular, there are two special variants for use in grammars: `token` and `rule`. ...the `rule` declarator[-,] [+is used] for declaring non-terminal productions in a grammar. Like a `token`, it also does not backtrack by default."_ https://design.raku.org/S05.html – jubilatious1 Mar 11 '21 at 04:57
3

As per the Raku docs:

  • Token methods are faster than regex methods and ignore whitespace. Token methods don't backtrack; they give up after the first possible match.
  • Rule methods are the same as token methods except whitespace is not ignored.

Not ignored means they are treated as syntax, rather than matched literally. They actually insert a <.ws>. See sigspace for more information about that.

3472948
  • 801
  • 4
  • 15
  • 1
    Yeah i know that, my original question also reflects the fact that I understand it doesn't ignore whitespace. But emails don't have whitespace, this shouldn't work! – Electric Coffee May 28 '20 at 06:25
  • "emails don't have whitespace, this shouldn't work!" You're right it shouldn't work, but it's not because emails don't have whitespace. Raku is carefully designed with the intent of making things that are complicated in other languages become simple. One way to do that is [pedagogical facilitation](https://www.youtube.com/watch?v=ORjyXcLDd9M&t=1m31s). `ws` is an examplar. The phrase "significant whitespace" is technically only about whitespace in the *pattern*, not input. Yes, the default `ws` can match whitespace. But it turns out that *`` can also match where there is **no** whitespace!* – raiph May 28 '20 at 15:04
  • 1
    @raiph but if it matches where there's no whitespace, how can you then rely on it? If I want a `rule constraint { NOT NULL | PRIMARY KEY }` I don't want `NOTNULL` to be a valid match – Electric Coffee May 29 '20 at 08:29
  • @ElectricCoffee `ws`'s definition fits well with an average person's natural language intuition. So it's easy to rely on it if your intuition is an average person's intuition, and you're willing to go with your intuitive sense. Or if you're willing to look at the doc definition -- `token { <!ww> \s* }` -- and rely on that. `rule constraint { NOT NULL | PRIMARY KEY }` shouldn't match `NOTNULL` per intuition, and won't per the default `ws` rule. But `rule { foo '@' bar }` *will* match both `foo @ bar` and `foo@bar`. Should `ws` have been named otherwise? Pedagogical facilitators thought not. :) – raiph May 29 '20 at 11:05
  • @raiph "rule constraint { NOT NULL | PRIMARY KEY } shouldn't match NOTNULL per intuition, and won't per the default ws rule. But rule { foo '@' bar } will matchboth foo @ bar and foo@bar" It's not at all obvious how the language would even distinguish between those two scenarios... – Electric Coffee May 30 '20 at 07:46
  • @ElectricCoffee It's not at all obvious *how*. But that's Larry's forte. *It accords with an average person's natural language intuition*. In `NOT NULL` there's a "token" (not `token`!) *break* between the `NOT` and the `NULL`. In `foo@bar` there are likewise "token" breaks between `foo` and `@` and `bar`. Or, if you're willing to look at the doc definition, it's `<!ww> \s*`, where `ww` means "between `\w` characters", and `!ww` means *not* between them, and `\w` goes per Unicode's "word character" property, and `\s` means whitespace characters, again per Unicode, and `\s*` means zero or more. – raiph May 30 '20 at 12:14
  • @raiph the average person's intuition is full of inconsistencies and is generally unreliable though – Electric Coffee May 31 '20 at 15:02
  • @ElectricCoffee I think *part* of the issue here is a sly trick Larry has pulled with `ws`. What does it stand for? I used to think "whitespace". And I think he partly intends to light up those neurons. But I now realize another fit is "word separator", where "word" is a string of characters that match Unicode's definition of a "word character", and `ws` is consistent with the parsing concept of a *token* separator. So a `token` pattern is about matching a single token whereas a `rule` pattern is about matching a string of *separated* "tokens" aka "words". `foo@bar` is 2 "words" / 3 "tokens". – raiph Mar 09 '21 at 22:50
  • 1
    @ElectricCoffee Bingo! Now I've figured out the mystery. I'll be completely rewriting my answer, adding text to the bug report to explain it's not a bug, recommending doc changes, etc. It might take me a while (weeks) to do it, but it sure feels wonderful to have solved the riddle, with such a simple outcome, which I'll be delighted to share when I get to it. Your question mortified me, because I thought it might be a behavioural bug rather than just misunderstanding on our part, and it was sooooo basic! I'm so glad you asked it, even if it almost caused me nightmares at the time. :) – raiph Mar 09 '21 at 23:22
  • 1
    @raiph I'm just doing my part trying to make things better :) – Electric Coffee Mar 10 '21 at 12:18
  • 1
    @ElectricCoffee I think you played your role well! I plan to play mine again, and hope you'll continue to play yours as my new answer unfolds. (I've decided to write a separate new answer, rather than edit my current one. I expect to have at least two parts, with the first directly addressing your original example; my hope is you will apply the same careful thinking you've shown heretofore, and dialog with me about the first part, in comments under the new answer, until you fully and unreservedly agree with it, and signal that by at least considering switching your acceptance marker to it.) – raiph Mar 10 '21 at 13:47