How to negate/subtract regexes (not only character classes) in Perl 6?

Question

It's possible to make a conjunction, so that the string matches 2 or more regex patterns.

> "banana" ~~ m:g/ . a && b . /
(｢ba｣)

Also, it's possible to negate a character class: if I want to match only consonants, I can take all the letters and subtract character class of vowels:

> "camelia" ~~ m:g/ <.alpha> && <-[aeiou]> /
(｢c｣ ｢m｣ ｢l｣)

But what if I need to negate/subtract not a character class, but a regex of any length? Something like this:

> "banana" ~~ m:g/ . **3 && NOT ban / # doesn't work
(｢ana｣)

A nit that's completely irrelevant to your question but feels worth mentioning... "if I want to match only consonants, I can take all the letters and subtract character class of vowels" `` matches the alphabetic characters of **any** human language, not just those of English. But `<-[aeiou]>` only matches English vowels. See also [How to match Unicode vowels?](https://stackoverflow.com/questions/38792789/how-to-match-unicode-vowels) — raiph, Aug 31 '19 at 15:54
Another comment that's completely irrelevant to your question but also feels worth mentioning... To add/subtract character classes it's more idiomatic to use `+` and `-` eg `"camelia" ~~ m:g/ <+alpha-[aeiou]> / # (｢c｣｢m｣｢l｣)`. — raiph, Aug 31 '19 at 16:17

raiph · Answer 1 · 2019-08-31T16:08:21.407

TL;DR Moritz's answer covers some important issues. This answer focuses on matching sub-strings per Eugene's comment ("I want to find substring(s) that match regex R, but don't match regex A.").

Write an assertion that says you are NOT sitting immediately before the regex you don't want to match and then follow that with the regex you do want to match:

say "banana" ~~ m:g/ <!before ban> . ** 3 / # (｢ana｣)

The before assertion is called a "zero width" assertion. This means that if it succeeds (which in this case means it does not "match" because we've written !before rather than just before), the matching position is not moved.

(Of course, if such an assertion fails and there's no alternative pattern that matches at the current match position, the match engine then steps forward one character position.)

It's possible that you want the patterns in the opposite order, with the positive match first and the negative second, as you showed in your question. (Perhaps the positive match is faster than the negative, so reversing their order will speed up the match.)

One way that will work for fairly simple patterns is using a negative after assertion:

say "banana" ~~ m:g/ . ** 3 <!after ban> / # (｢ana｣)

However, if the negative pattern is sufficiently complex you may need to use this formulation:

say "banana" ~~ m:g/ . ** 3 && <!before ban> .*? / # (｢ana｣)

This inserts a && regex conjunction operator that, presuming the LHS pattern succeeds, tries the RHS as well after resetting the matching position (which is why the RHS now starts with <!before ban> rather than <!after ban>) and requires that the RHS matches the same length of input (which is why the <!before ban> is followed by the .*? "padding").

Thanks for the explanation and for the link! Probably I'm wrong, but it seems that `<!before >` isn't the full equivalent, first of all in the terms of length (which should be the same for `&&` but doesn't have to for `<!before >`). Otherwise why do we have `&&` and not simply use `` instead of it? — Eugene Barsky, Nov 20 '17 at 17:16
@EugeneBarsky `&&` sets up an end-of-match anchor for the "me too" regex on the right and performs best when the regex on the left matches faster than the regex on the right. Sometimes you want that restriction and/or performance (and/or even just the way `&&` reads) in which case `&&` is likely the way to go, and may even be the right way to go even if the right hand regex starts with a zero width assertion such as `<!before ...>` followed by what I called "padding". Conversely, in other scenarios, a plain `` followed by another regex may be simpler, faster and/or more readable. — raiph, Nov 20 '17 at 19:41
@EugeneBarsky Thinking of clarifying [the doc](https://docs.perl6.org/language/regexes#Conjunction:_&&)... "I didn't understand that!" Do you mean the substring end-of-match anchor? (Or the left-to-right evaluation order? Or the readability? Or something else?) — raiph, Nov 21 '17 at 16:56
I mean the use of `&&`: "when the regex on the left matches faster than the regex on the right". So your comment helped me, since it not only explains *how* the instrument works, but also *when* it's best to use it. — Eugene Barsky, Nov 21 '17 at 17:49

score 4 · Accepted Answer · answered Nov 20 '17 at 18:52

What does it even mean to "negate" a regex?

When you talk about the computer science definition of a regex, then it always needs to match a whole string. In this scenario, negation is pretty easy to define. But by default, regexes in Perl 6 search, so they don't have to match the whole string. This means you have to be careful to define what you mean by "negate".

If by negation of a regex A you mean a regex that matches whenever A does not match a whole string, and vice versa, you can indeed work with <!before ...>, but you need to be careful with anchoring: / ^ <!before A $ > .* / is this exact negation.

If by negation of a regex A you mean "only match if A matches nowhere in the string", you have to use something like / ^ [<!before A> .]* $ /.

If you have another definition of negation in mind, please share it.

Thanks for your explanations! What I mean is: I want to find substring(s) that match regex `R`, but don't match regex `A`. — Eugene Barsky, Nov 20 '17 at 19:12

How to negate/subtract regexes (not only character classes) in Perl 6?

2 Answers2