7

I am using Raku 2020.10.

According to this page, https://docs.raku.org/language/regexes#Longest_alternation:_| , "|" or quoted lists are longest matches.

> say "youtube" ~~ / < you tube > /
「you」                                   # expected "tube" to win the match
> say "youtube" ~~ /  you | tube  /
「you」                                   # expected "tube" to win the match
> say "youtube" ~~ / tube | you /
「you」                                   # expected "tube" to win the match

Now trying "||" instead of "|":

> say "tubeyou" ~~ / you || tube /
「tube」                                # longest match or first match?
> say "youtube" ~~ / you || tube /
「you」                                 # first match?

Now trying web page example:

> say 'food' ~~ / f | fo | foo | food /
「food」                                                 # works as expected
> say 'foodtubes' ~~ / f | fo | foo | food | tubes /
「food」                                                 # expected "tubes" (5 chars) to win
> say 'foodtubes' ~~ / tubes | f | fo | foo | food /
「food」
> say 'foodtubes' ~~ / dt /
「dt」
> say 'foodtubes' ~~ / dt | food /
「food」
> say 'foodtubes' ~~ / dt | food | tubes /
「food」

Seems like the matching engine with "|" quits after first somewhat longish successful match. Or what did I do wrong?

Thanks !!!

Elizabeth Mattijsen
  • 25,654
  • 3
  • 75
  • 105
lisprogtor
  • 5,677
  • 11
  • 17
  • 2
    I think it is longest alternation that can match from the current point. Given that 'you' matches from the start of the string and 'tube' does not, then 'you' ends up being the longest alternation. – donaldh Feb 15 '21 at 10:14
  • Perhaps you can use an adverb like :overlap or :exhaustive and then choose the longest match you get. https://docs.raku.org/language/regexes#Overlap – donaldh Feb 15 '21 at 10:28
  • I think we've all been confused about this at times. See [Moritz's answer](https://stackoverflow.com/a/50830577/1077672) about it on another of your Qs involving the same issue. Previous SOs (yours and others') have mixed in other aspects, but this time your Q title and content precisely focus on just this specific aspect, and your exposition of the problem is beautifully clear. So now we'll all be likely to find this SO if any newbie or one of us oldies is confused about this particular issue. :) As I wrote this I see @codesections has written an answer; I'm confident they'll clear things up. – raiph Feb 15 '21 at 16:32
  • 1
    Thank you very much donaldh and raiph !!! And raiph, you even look up my old questions :-) Even with Moritz's answer, my understanding was incomplete, and I thank you all for leading me to deeper understanding! It is always euphoric to understand more :-) – lisprogtor Feb 15 '21 at 20:38

2 Answers2

7

(This answer builds on what @donaldh already said in a comment).

This is a really good question, because it gets at something that often trips people up about how a regex searches a string: a regex fundamentally searches one character at a time and returns the first match it finds. You can modify this behavior (e.g., look-arounds consider other characters; the several flags make the regex return more than one result). But if you start from the basic understanding of how the behaves regex by default, a lot of these issues become clearer.

So, let's apply that to a slight variant of your example:

> `youtube' ~~ / you | ..| tube /
「you」

Here's how the regex engine looks at it (in high-level/simplified terms), character by character:

pos:0    youtube
         ^
branch 1 wants 'y'.                Match!
branch 2 wants . (aka, anything).  Match!
branch 3 wants 't'                 No match :(

pos:1    youtube
          ^
branch 1 wants 'o'.                Match!
branch 2 wants .                   Match!
branch 2 completed with a length of 2

pos:2    youtube
           ^
branch 1 wants 'u'.                Match!
branch 1 completed with a length of 3

...all branches completed, and 2 matches found.  Return the longest match found.

「you」

The consequence of this logic is that, as always, the regex returns the first match in the string (or, even more specifically, the match that starts at the earliest position in the string). The behavior of | kicks in when there are multiple matches that start at the same place. When that happens, | means that we get the longest match.

Conversely, with 'youtube' ~~ / you | tube /, we never have multiple matches that start at the same place, so we never need to rely on the behavior of |. (We do have multiple matches in the string, as you can see with a global search: 'youtube' ~~ m:g/ you | tube /)

If you want the longest of all matches in the string (rather than the longest option for the first match) then you can do so with something like the following:

('youtube' ~~ m:g/ you | tube /).sort(*.chars).tail
codesections
  • 8,900
  • 16
  • 50
2

This is not a question of longest match.

This is a question of earliest match.

'abcd' ~~ / bcd | . /; # 「a」

Imagine that the above regex is actually surrounded by this:

/^ .*? <([      …      ])> .* $/

So then we have:

/^ .*? <([   bcd | .   ])> .* $/

Note that the first .*? is non-greedy. It prefers to not capture anything.

'abcd' ~~ /^ .*? <([  bcd | .  ])> .* $/; # 「a」

It will if it has to though

'abcd' ~~ /^ .*? <([  bcd | b  ])> .* $/; # 「bcd」
Brad Gilbert
  • 33,846
  • 11
  • 78
  • 129