Grammar not parsing as expected with negative lookaround assertion

Question

OK, this is either a bug or I'm going to look like a total idiot and I'm using a lookaround assertion completely wrong. I don't care about the latter so here we go.

Got this grammar I'm testing:

our grammar HC2 {
        token TOP { <line>+ }
        token line { [ <header> \n | <not-header> \n ] }
        token header { <header-start> <header-content> }
        token not-header { \N* }
        token header-start { <header-one> }
        token header-one { <[#]> <![#]> } # note this negative lookahead here
        token header-content { \N* }
}

I want to capture a markdown header with just one # sign, no more.

Here is the output from Grammar::Tracer/Debugger:

So it's skipping right over the <header-start> capture. If I remove the <![#]> negative lookahead assertion, I get this:

So is this a bug or am I out to lunch?

As text:

TOP
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH "# Grandmother's for a Brighter Future"
> 
|  * MATCH "# Grandmother's for a Brighter Future\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH ""
> 
|  * MATCH "\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH "# Development site"
> 
|  * MATCH "# Development site\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH "* The new site is up and running at example.com"
> 
|  * MATCH "* The new site is up and running at example.com\n"
> 
|  line
> 
|  |  not-header
> 


TOP
> 
|  line
> 
|  |  header
> 
|  |  |  header-start
> 
|  |  |  |  header-one
> 
|  |  |  |  * MATCH "#"
> 
|  |  |  * MATCH "#"
> 
|  |  |  header-content
> 
|  |  |  * MATCH " Grandmother's for a Brighter Future"
> 
|  |  * MATCH "# Grandmother's for a Brighter Future"
> 
|  * MATCH "# Grandmother's for a Brighter Future\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH ""
> 
|  * MATCH "\n"
> 
|  line
> 
|  |  header
> 
|  |  |  header-start
> 
|  |  |  |  header-one
> 
|  |  |  |  * MATCH "#"
> 
|  |  |  * MATCH "#"

UPDATE: If I modify header-start to:

token header-one { <[#]> <-[#]> }

it matches as expected. However, that does not answer the question as to why the original code does not match.

Please post your output as text, rather than images of text. — Chris, Mar 29 '22 at 22:07
OK, seems that my question is a dupe of this question: https://stackoverflow.com/questions/62686065/using-after-as-lookbehind-in-a-grammar-in-raku — StevieD, Mar 29 '22 at 23:36
Checkout results of `say m/ 'a' <![a]> \N* | (\N*) / for 'a', 'ab', 'abc'; say m/ 'a' <![a]> \N* || (\N*) / for 'a', 'ab', 'abc';`. [Google for "declarative prefix"](https://www.google.com/search?q=%22declarative+prefix%22). I have to go to sleep now. — raiph, Mar 30 '22 at 00:40
There ARE issues with lookarounds, see: "Lookaround regex and character consumption" https://stackoverflow.com/q/69004383/7270649 — jubilatious1, Mar 31 '22 at 14:20
I've rewritten my regex to get rid of them and do things in a cleaner way. — StevieD, Mar 31 '22 at 14:44

score 5 · Answer 1 · answered Mar 30 '22 at 05:58

OK, so the non-technical answer is I made a bad assumption that the | character behaves the same was as in Perl. It does not. In Perl, the regex engine attempts to match the pattern on the left hand side of the | character. If that fails, it moves on to the pattern in the right hand side.

To get the "old school" Perl behavior, use the || operator, called the "Alternation" operator: https://docs.raku.org/language/regexes#Alternation:_||

The | operator is called the "Longest Alternation" operator. See https://docs.raku.org/language/regexes#Longest_alternation:_|

A more detailed, much more technical discussion of how the "Longest Alternation" operator works is here: https://design.raku.org/S05.html#Longest-token_matching

Though I was already aware the || existed from my reading of the docs, I didn't read about it carefully. I mistakenly assumed Raku core developer would make | behave like it did in Perl and that || was some cool new operator I could learn about later.

Big takeaway: try hard to uncover the basic assumptions you are making and don't assume anything until you've read the docs closely.

Seems like this difference between Perl and Raku should be noted here: https://docs.raku.org/language/5to6-perlop — StevieD, Mar 30 '22 at 06:08
The behavior is buried on this long page of "Traps": https://docs.raku.org/language/traps#|_vs_||:_which_branch_will_win — StevieD, Mar 30 '22 at 06:15

Grammar not parsing as expected with negative lookaround assertion

1 Answers1