Shortest match issues

Question

I know the ? operator enables "non greedy" mode, but I am running into a problem, I can't seem to get around. Consider a string like this:

my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';

where there are opening and closing tags <a> and </a>, there are keys ABC, DEF and GHI but are surrounded by some other random text. I want to replace the <a>klashsdjDEFasl;jjf</a> with <b>TEST</b> for example. However, if I have something like this:

$str =~ s/<a>.*?DEF.*?<\/a>/<b>TEST><\/b>/;

Even with the non greedy operators .*?, this does not do what I want. I know why it does not do it, because the first <a> matches the first occurrence in the string, and matches all the way up to DEF, then matches to the nearest closing </a>. What I want however is a way to match the closest opening <a> and closing </a> to "DEF" though. So currently, I get this as the result:

<a>TEST</b><a>askldhsfGHIasfklhss</a>

Where as I am looking for something to get this result:

<a>sdkhfdfojABCasjklhd</a><b>TEST</b><a>askldhsfGHIasfklhss</a>

By the way, I am not trying to parse HTML here, I know there are modules to do this, I am simply asking how this could be done.

Thanks, Eric Seifert

cjm · Accepted Answer · 2011-04-22T17:55:29.193

$str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;

The problem is that even with non-greedy matching, Perl is still trying to find the match that starts at the leftmost possible point in the string. Since .*? can match <a> or </a>, that means it will always find the first <a> on the line.

Adding a greedy (.*) at the beginning causes it to find the last possible matching <a> on the line (because .* first grabs the whole line, and then backtracks until a match is found).

One caveat: Because it finds the rightmost match first, you can't use this technique with the /g modifier. Any additional matches would be inside $1, and /g resumes the search where the previous match ended, so it won't find them. Instead, you'd have to use a loop like:

1 while $str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;

score 2 · Answer 2 · edited Apr 22 '11 at 17:13

2

Instead of a dot which says: "match any character", use what you really need which says: "match any char that is not the start of </a>". This translates into something like this:

$str =~ s/<a>(?:(?!<\/a>).)*DEF(?:(?!<\/a>).)*<\/a>/<b>TEST><\/b>/;

edited Apr 22 '11 at 17:13

ysth

96,171
6
121
214

answered Apr 22 '11 at 17:10

ridgerunner

33,777
5
57
69

score 0 · Answer 3 · answered Jul 24 '18 at 01:41

0

Based on my understanding, this is what you are looking for.

Use of Lazy quantifiers ? with no global flag is the answer.

Eg,

If you had global flag /g then, it would have matched all the lowest length matches as below.

answered Jul 24 '18 at 01:41

Uddhav P. Gautam

7,362
3
47
64

score 0 · Answer 4 · answered Apr 22 '11 at 17:15

0

#!/usr/bin/perl
use warnings;
use strict;

my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';

my @collections = $str =~ /<a>.*?(ABC|DEF|GHI).*?<\/a>/g;

print join ", ", @collections;

answered Apr 22 '11 at 17:15

SymKat

841
5
5

All you did was change the regex so it matches every occurrence of `...` in the string. That doesn't solve the original problem, which is to match only one of those groups. – cjm Apr 22 '11 at 17:46

score 0 · Answer 5 · answered Apr 22 '11 at 19:19

0

s{
   <a>
   (?: (?! </a> ) . )*
   DEF   
   (?: (?! </a> ) . )*
   </a>
}{<b>TEST</b>}x;

Basically,

(?: (?! PAT ) . )

is the equivalent of

[^CHARS]

for regex patterns instead of characters.

answered Apr 22 '11 at 19:19

ikegami

367,544
15
269
518

Shortest match issues

5 Answers5

Linked