1

I know the ? operator enables "non greedy" mode, but I am running into a problem, I can't seem to get around. Consider a string like this:

my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';

where there are opening and closing tags <a> and </a>, there are keys ABC, DEF and GHI but are surrounded by some other random text. I want to replace the <a>klashsdjDEFasl;jjf</a> with <b>TEST</b> for example. However, if I have something like this:

$str =~ s/<a>.*?DEF.*?<\/a>/<b>TEST><\/b>/;

Even with the non greedy operators .*?, this does not do what I want. I know why it does not do it, because the first <a> matches the first occurrence in the string, and matches all the way up to DEF, then matches to the nearest closing </a>. What I want however is a way to match the closest opening <a> and closing </a> to "DEF" though. So currently, I get this as the result:

<a>TEST</b><a>askldhsfGHIasfklhss</a>

Where as I am looking for something to get this result:

<a>sdkhfdfojABCasjklhd</a><b>TEST</b><a>askldhsfGHIasfklhss</a>

By the way, I am not trying to parse HTML here, I know there are modules to do this, I am simply asking how this could be done.

Thanks, Eric Seifert

Eric Seifert
  • 1,946
  • 2
  • 17
  • 31

5 Answers5

6
$str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;

The problem is that even with non-greedy matching, Perl is still trying to find the match that starts at the leftmost possible point in the string. Since .*? can match <a> or </a>, that means it will always find the first <a> on the line.

Adding a greedy (.*) at the beginning causes it to find the last possible matching <a> on the line (because .* first grabs the whole line, and then backtracks until a match is found).

One caveat: Because it finds the rightmost match first, you can't use this technique with the /g modifier. Any additional matches would be inside $1, and /g resumes the search where the previous match ended, so it won't find them. Instead, you'd have to use a loop like:

1 while $str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;
cjm
  • 61,471
  • 9
  • 126
  • 175
2

Instead of a dot which says: "match any character", use what you really need which says: "match any char that is not the start of </a>". This translates into something like this:

$str =~ s/<a>(?:(?!<\/a>).)*DEF(?:(?!<\/a>).)*<\/a>/<b>TEST><\/b>/;
ysth
  • 96,171
  • 6
  • 121
  • 214
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
0

Based on my understanding, this is what you are looking for.

Use of Lazy quantifiers ? with no global flag is the answer.

Eg,

enter image description here

If you had global flag /g then, it would have matched all the lowest length matches as below. enter image description here

Uddhav P. Gautam
  • 7,362
  • 3
  • 47
  • 64
0
#!/usr/bin/perl
use warnings;
use strict;

my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';

my @collections = $str =~ /<a>.*?(ABC|DEF|GHI).*?<\/a>/g;

print join ", ", @collections;
SymKat
  • 841
  • 5
  • 5
  • All you did was change the regex so it matches every occurrence of `...` in the string. That doesn't solve the original problem, which is to match only one of those groups. – cjm Apr 22 '11 at 17:46
0
s{
   <a>
   (?: (?! </a> ) . )*
   DEF   
   (?: (?! </a> ) . )*
   </a>
}{<b>TEST</b>}x;

Basically,

(?: (?! PAT ) . )

is the equivalent of

[^CHARS]

for regex patterns instead of characters.

ikegami
  • 367,544
  • 15
  • 269
  • 518