Regular expression puzzler

Question

I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):

"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'

Can someone help solve that riddle?

Subsequent atoms must match at the position where the previous atom's match ended. That's why `/niy/` doesn't match `nify`, and neither does `/[niy]{3}/`. — ikegami, Jan 14 '22 at 15:18

zdim · Answer 1 · 2022-09-13T16:18:25.633

The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.

That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:

(?=[iny].*[iny].*[iny])

This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.

A Perl example, to copy-paste on the command line:

perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'

The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way

my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';

and then use that in the regex.

For the sake of performance, for long strings and many repetitions, make that .* non-greedy

(?=[iny].*?[iny].*?[iny])

(when forming the pattern dynamically join with .*?)

A simple benchmark for illustration (in Perl)

use warnings;
use strict;
use feature 'say';

use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );

# For how many seconds to run each option (-r N, default 3), 
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);

my $str = 'aa'
    . join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
    . 'a'x($n+1) 
    . 'zzz'; 
    
my $pat_greedy     = '(?=' . join('.*',  ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy     = join('.*',  ('a')x$n);  # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n);  # not lookahead

sub match_repeated {
    my ($s, $pla) = @_;
    return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}   

cmpthese(-$runfor, {
    greedy     => sub { match_repeated($str, $pat_greedy) },
    non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});

(Shuffling of that string is probably unneeded but I feared optimizations intruding.)

When a string is made with the factor of 20 (program.pl -n 20) the output is

              Rate     greedy non_greedy
greedy      56.3/s         --      -100%
non_greedy 90169/s    159926%         --

So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.

Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:

               Rate     greedy non_greedy
greedy       56.5/s         --      -100%
non_greedy 171949/s    304117%         --

John Kugelman · Accepted Answer · 2022-01-14T04:31:31.203

The letters n, i, and y aren't all adjacent. There's an f in between them.

/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.

Choosing three characters three times, with replacement, means there are 3³ = 27 matching substrings:

iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy

To match non-adjacent letters you can use one of these:

```
[iny].*[iny].*[iny]
```
```
[iny](.*[iny]){2}
```
```
([iny].*){3}
```

(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)

score 2 · Answer 3 · answered Jan 14 '22 at 02:56

2

That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.

Perhaps you meant to use [inf] or [ify]?

answered Jan 14 '22 at 02:56

D Stanley

149,601
11
178
240

Todd A. Jacobs · Answer 4 · 2022-01-14T03:02:42.683

The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.

It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.

score 2 · Answer 5 · answered Jan 14 '22 at 03:01

Looks like you are looking for 3 consecutive letters, so yours should not match

[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni

Hope that helps somewhat :)

Regular expression puzzler

5 Answers5