6

I've been using perl for a decade. But lately I've got confused with using .*? regex.

It does not seem to match the minimum number of characters. Sometimes it gives different results.

For example for this string:aaaaaaaaaaaaaaaaaaaaaaammmmmmmmmmmbaaaaaaaaaaaaaaaaaaaaaab and pattern: a.*?b it matches complete input string in two groups. As per the definition it should have matched the last "ab".

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
AgA
  • 2,078
  • 7
  • 33
  • 62
  • @Kobi - I don't think this is the same question. OP is asking why `.*?` doesn't always match the minimum number of characters, not what the purpose is of the `?`. – Ted Hopp Mar 23 '11 at 07:15
  • @Ted - fair enough, but I think if you understand [how `.*?` works](http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532) you can easily answer that question. In that case, it may be a generalization of this, by the same user: http://stackoverflow.com/questions/5401571/ – Kobi Mar 23 '11 at 07:17
  • That would not have elicited the answer I'm looking for this question. That's why I've started a new but related thread. – AgA Mar 23 '11 at 07:43

6 Answers6

11

It doesn't cause a.*?b to match the fewest characters possible; it causes .* to match the fewest characters possible. Since it only affects .*, it has no effect on what's already been matched (i.e. by a).

Example shortened to:

#01234
'aaab' =~ /a.*?b/

What happens:

  1. At pos 0, a matches 1 character (a).
  2. At pos 1, .*? matches 0 characters (empty string).
  3. At pos 1, b fails to match. ⇒ backtrack
  4. At pos 1, .*? matches 1 character (a).
  5. At pos 2, b fails to match. ⇒ backtrack
  6. At pos 1, .*? matches 2 characters (aa).
  7. At pos 3, b matches 1 character (b)
  8. Pattern match successful.

As you can see, it tried to match zero characters, which is obviously the smallest possible match. But the overall pattern failed to match when it did so, so larger and larger matches were tried until the overall pattern matched.


I try to avoid the non-greedy modifier.

'aaab' =~ /a[^a]*b/

If a is really something more complex, then one can use a negative lookahead.

'aaab' =~ /a(?:(?!a).)*b/
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 2
    Agreed on avoiding the non-greedy modifier. Not only is it often more efficient to use a negated character class or negative look-ahead (it avoids backtracking), it also documents your intention more clearly for future maintenance programmers. – Dave Sherohman Mar 23 '11 at 09:54
  • what does backtrack mean? – Golden Lion May 09 '22 at 16:17
  • @Golden Lion, It's pretty well explained in the [docs](https://perldoc.perl.org/perlre#Backtracking). Basically, undoing things tried, in order to attempt something different. – ikegami May 09 '22 at 16:23
  • I am trying to find a list in a string and a list of a list in a string. How can I do that with backtrack. [[1.5, 1.5], [2.1, 2.1]] , [1, 2] – Golden Lion May 09 '22 at 17:21
9

It means

.   # match any character except newlines
*   # zero or more times
?   # matching as few characters as possible

So in

<tag> text </tag> more text <tag> even more text </tag>

the regex <tag>(.*)</tag> will match the entire string at once, capturing

 text </tag> more text <tag> even more text 

in backreference number 1.

If you match that with <tag>(.*?)</tag> instead, you'll get two matches:

  1. <tag> text </tag>
  2. <tag> even more text </tag>

with only text and even more text being captured in backreference number 1, respectively.

And if (thanks Kobi!) your source text is

<tag> text <tag> nested text </tag> back to first level </tag>

then you'll find out that <tag>(.*)</tag> matches the whole string again, but <tag>(.*?)</tag> will match

<tag> text <tag> nested text </tag>

because the regex engine works from left to right. This is one of the reasons regular expressions are "not the best tool" for matching context-free grammars.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • If I understand correctly, the OP is more interested in `text` - there's a confusion between *lazy* and *optimally minimal*. – Kobi Mar 23 '11 at 07:51
  • Tim thanks, I should have accepted your answer too for being crystal clear but ... we can accept only one answers here. – AgA Mar 23 '11 at 09:59
4

It matches the smallest number of characters, starting from the first position that can match, which allows the rest of the regex to match. The middle part of that (starting from...) is inherent to the way the regex state machine operates. (edited for further clarification)

geekosaur
  • 59,309
  • 11
  • 123
  • 114
  • 1
    This is the correct "theoritical" definition. But does not work extactly this way. For example for this string:aaaaaaaaaaaaaaaaaaaaaaammmmmmmmmmmbaaaaaaaaaaaaaaaaaaaaaab and pattern: a.*?b it matches complete input string in two groups. As per the definition it should have matched the last "ab". – AgA Mar 23 '11 at 06:58
  • 1
    Ok, there's still a part you're missing: in the absence of anything else, it will start at the earliest possible place. (Put another way, it shortens from the end, not from the beginning.) – geekosaur Mar 23 '11 at 07:02
  • 1
    user656848, don't forget that regex engines start matching at the first character if they can, and move forwards. `.*?` isn't a guarantee of _shortest possible match_, but consuming the fewest number of elements _at that position_ that still allows the rest of the regex to match. – sarnold Mar 23 '11 at 07:03
  • 2
    No, regular expressions are applied from left to right. Making it lazy doesn't mean the regex will match the shortest possible string, but "the shortest possible match from the current position in the string". Therefore you get `aaaaaaaaaaaaaaaaaaaaaaaaaab` as your second match in your example. – Tim Pietzcker Mar 23 '11 at 07:03
  • @user656848: Please put this example in your question. It is pertinent information. Simply saying "it doesn't work" casts doubt in newbie's minds on a great tool that can be horribly misunderstood or misused. – Merlyn Morgan-Graham Mar 23 '11 at 07:05
1

It's supposed to match the minimum number of characters necessary for there to be a successful match of the pattern as a whole (if there is a match at all). Can you provide a specific example where it doesn't do this?

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • @user656848 - I hope that the comments by @geekosaur, @sarnold, and @Tim Pietzcker to geekosaur's answer clarify what's going on for you. I agree with @Merlyn that you should edit your original post and add the example. It clarifies what you are asking. – Ted Hopp Mar 23 '11 at 07:19
1

I don't think you can directly match ab in your case. Often when .*? doesn't work, it calls for the [^c]* pattern, where c is a character or character class. This prevents false-positive matches

In this case though, it doesn't work: a[^a]*b matches ammmmmmmmmmmb first. So the only way to find the shortest match is to find all of the matches, then pick the shortest.

Below is a detailed (you said you haven't worked with Perl in a while ;--) way of getting to the result you want:

#!/usr/bin/perl 

use strict;
use warnings;

use List::Util qw(reduce); # see List::Util docs for what reduce does

my $s= "aaaaaaaaaaaaaaaaaaaaaaammmmmmmmmmmbaaaaaaaaaaaaaaaaaaaaaab";

my $RE= qr/a[^a]*b/;

print "regexp: $RE\n";                 # ammmmmmmmmmmb
print "single match:\n";
if( $s=~ m{($RE)}) { print "  $1\n"; } 

print "all matches (loop):\n";         # ammmmmmmmmmmb \n ab
while( $s=~ m{($RE)}g)
  { print "  - $1\n"; }

print "all matches (in an array):\n";  # ammmmmmmmmmmb - ab
my @matches= $s=~ m{(a[^a]*b)}g;
if( @matches) { print "  ", join( " - ", @matches), "\n"; }

print "\nshortest match: ";            # ab
print reduce { length $a < length $b ? $a : $b } @matches;
print "\n";

In short, lazy matching is not the same as getting the shortest match in a string. And getting that shortest match is not a simple problem with the kind of rexegp engine Perl (and I believe most other languages) use.

mirod
  • 15,923
  • 3
  • 45
  • 65
0

Please give a concrete example that we can use to repro the problematic behavior you are experiencing.

You're using the right construct, so maybe there's a problem in the rest of your query. I have had problems with it too, but have always figured out that it was due to my wishful parsing - i.e. I wish that regular expressions parsed the way I meant, rather than the way I typed :)

Merlyn Morgan-Graham
  • 58,163
  • 16
  • 128
  • 183