8

How can I use capturing groups inside lookahead assertion?

This code:

say "ab" ~~ m/(a) <?before (b) > /;

returns:

「a」
 0 => 「a」

But I was expecting to also capture 'b'.

Is there a way to do so?

I don't want to leave 'b' outside of the lookahead because I don't want 'b' to be part of the match.

Is there a way to capture 'b' but still leave it outside of the match?

NOTE:

I tried to use Raku's capture markers, as in:

say "ab" ~~ m/<((a))> (b) /;

「a」
 0 => 「a」
 1 => 「b」

But this does not seem to work as I expect because even if 'b' is left ouside the match, the regex has processed 'b'. And I don't want to be processed too.

For example:

say 'abab' ~~ m:g/(a)<?before b>|b/;

(「a」
    0 => 「a」
 「b」 
 「a」
    0 => 「a」
 「b」)

# Four matches (what I want)
 

say 'abab' ~~ m:g/<((a))>b|b/;

(「a」
    0 => 「a」 
 「a」
    0 => 「a」)

# Two matches
Julio
  • 5,208
  • 1
  • 13
  • 42
  • "Is there a way to capture 'b' but still leave it outside of the match?" -- The basic question you seem to be presenting is whether you can capture without matching. AFAIK Raku's (and Perl's) regex systems are designed to match with an optional capture, not the other way around. But see Jonathan's answer for advanced coding. – jubilatious1 Nov 19 '20 at 17:28
  • For readers at home, it's more common to use capture markers `<(` and `)>` without nesting, example `<(a)>` not `<((a))> `, see: https://docs.raku.org/language/regexes#Capture_markers:_%3C(_)%3E): – jubilatious1 Nov 19 '20 at 18:38
  • Using the most recent Rakudo_2020.10 (built from source), I'm seeing a different result for Julio's third codeblock example above, see: https://gist.github.com/jubilatious1/e4da45c3020f3c8c745c2c4325e33c6f – jubilatious1 Nov 19 '20 at 19:24
  • 1
    @jubilatious1 I believe the results are the same. I got the same results as yours, I just added some newlines between elements, a new line after showing the content of every group. I believe it should be like that but for some reason the next line is appeneded to the previous one – Julio Nov 19 '20 at 23:56
  • Thank you for the note! Yes, Raku seems to output a 'compact' form of matches, I wonder if there's a routine to automatically expand it? – jubilatious1 Nov 24 '20 at 20:40

1 Answers1

7

Is there a way to do so?

Not really, but sort of. Three things conspire against us in trying to make this happen.

  1. Raku regex captures form trees of matches. Thus (a(b)) results in one positional capture that contains another positional capture. Why do I mention this? Because the same thing is going on with things like before, which take a regex as an argument: the regex passed to before gets its own Match object.
  2. The ? implies "do not capture". We may think of dropping it to get <before (b)>, and indeed there is a before key in the Match object now, which sounds promising except...
  3. before doesn't actually return what it matched on the inside, but instead a zero-width Match object, otherwise if we did forget the ? we'd end up with it not being a lookahead.

If only we could rescue the Match object from inside of the lookahead. Well, we can! We can declare a variable and then bind the $/ inside of the before argument regex into it:

say "ab" ~~ m/(a) :my $lookahead; <?before b {$lookahead = $/}> /;
say $lookahead;

Which gives:

「a」
 0 => 「a」
「b」

Which works, although it's unfortunately not attached like a normal capture. There's not a way to do that, although we can attach it via make:

say "ab" ~~ m/(a) :my $lookahead; <?before (b) {$lookahead = $0}> { make $lookahead } /;
say $/.made;

With the same output, except now it will be reliably attached to each match object coming back from m:g, and so will be robust, even if not beautiful.

Jonathan Worthington
  • 29,104
  • 2
  • 97
  • 136
  • `"ab" ~~ m/(a) /` works in a recent Rakudo. – raiph Nov 18 '20 at 22:19
  • @raiph Hmm, I'm a little surprised at that...what is `$¢` specified as referring to? – Jonathan Worthington Nov 19 '20 at 00:31
  • Here's [the only roast test I found](https://github.com/Raku/roast/blob/a85a8cfcb1ffe34243578556307a4568dbd2203a/S05-capture/match-object.t#L62). NB I haven't *grep'd* roast. (No desktop atm.) S05 has 12 matches starting [here](https://design.raku.org/S05.html#line_983). – raiph Nov 19 '20 at 14:46
  • @jnthn Am I right that `$¢` was mostly about maintaining a distinction between `Cursor` and `Match`, and that that was mostly about Raku's advanced capabilities and performance, eg read-only vs read-write capabilities related to parse state, and updates/publication of `$0` etc, and that @Larry ended up deciding to empty `Match`'s supertype `Cursor`, rendering `$¢` vestigial and (almost, but not quite) redundant? – raiph Nov 19 '20 at 14:52
  • More notes. From an SO answer: ["\[`$¢` is\] the most recent outermost match"](https://stackoverflow.com/a/61451104/1077672). Doc discussion is on [`Match` page](https://docs.raku.org/type/Match#index-entry-$%C2%A2). – raiph Nov 19 '20 at 14:53
  • 1
    @raiph I think one of the big questions should be when the $¢ gets a new scope . In a grammar, each token basically gets its own `$¢`, so the question is whether a regex passed as an argument (as, in effect, happens with `` should refer to the main `$¢` (in which case `$¢.make` should work) or it should get its own scope (and thus `$¢.make` is as useful as as a `$/.make` in a non-captured match). TBH, I'm inclined to think the latter is the correct behavior, but it definitely falls in the realm of potential ambiguity/gotcha territory – user0721090601 Nov 20 '20 at 20:00