2

Here's a small but functional snippet of Perl code:

my $content = qq{<img src='h};
if ($content =~ m{src=(?!('*)http://)}) {
   print "Match '$1'\n";
}
else {
   print "No match\n";
}

It prints

Match '''

That is regex ('*) inside negative look ahead has indeed been captured and contains '.

However if I replace the first line with

my $content = qq{<img src='i};

the script prints

Match ''

meaning the ' has not been captured despite the fact the entire regex matched.

Can anybody explain what's the difference and how can I make it so that ' is always captured (this is of course a simplification of a real case).

Thanks in advance

Addendum

Now this is the whole story for raina77ow. The idea is to replace the contents of the src attribute in the img tag. The following rules apply:

  1. If contents starts with ' it must end with '.
  2. If contents starts with " it must end with ".
  3. Contents can be unquoted.
  4. If contents (after possible quote) starts with http:// it should be left intact, other wise the last component of URL (image file name) must be kept and the preceding part must be replaced with smth.

Originally I wanted to use the following regex (which is practically the same you suggested)

$content =~ s{<\s*img\s+(.*?)src\s*=\s*(["']*)(?!http://).*?([^/"']+)\2(\s+[^>]+)*>}
             {'<img ' . $1 . 'src="' . 'SMTH' . $3 . '"' . $4 . '>'}sgie;

but for some reason it matches the string

[img src='http://qq.com/img.gif' /]

(angle brackets are replaced with square ones).

although it should not because ' is followed by http://. Using

$content =~ s{<\s*img\s+(.*?)src\s*=\s*(["'])*(?!http://).*?([^/"']+)\2(\s+[^>]+)*>}
             {'<img ' . $1 . 'src="' . 'SMTH' . $3 . '"' . $4 . '>'}sgie;

is also inappropriate as in this case \2 will not match empty string.

Not being able to fix that I decided to look for some workaround. Alas...

BenMorel
  • 34,448
  • 50
  • 182
  • 322
  • 1
    First I would suggest to parse html with a html parser and not with regexes. The `http://` portion is missing in `$content` so it will not match. – matthias krull Jun 18 '12 at 10:50
  • I reiterate using a real HTTP parser. Regular expressions are not well suited for dealing with HTML. I recommend [HTML::TreeBuilder::XPath](https://metacpan.org/module/HTML::TreeBuilder::XPath) – Quentin Jun 18 '12 at 10:52
  • Thank you for advice, but that does not answer the original question. And then, as I pointed out, th regex does match. The problem is that capturing parentheses are not captured. – user1463382 Jun 18 '12 at 10:53
  • If they answered the original question, they'd be answers rather then comments pointing out that the approach implied by the original question is fundamentally flawed. – Quentin Jun 18 '12 at 11:54
  • I see. As this is my first question at Stack Overflow, I did not know about the difference between comments and answers. Please excuse. – user1463382 Jun 18 '12 at 11:59
  • You are trying to reinvent the wheel. [Regexes are notoriously unsuitable for parsing HTML](http://stackoverflow.com/a/1732454/725418). Use an HTML-parser instead. – TLP Jun 18 '12 at 12:49

2 Answers2

4

Applying the four rules from the question with a robust HTML parser/library:

use strictures;
use URI qw();
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<html><head></head><body>
<img src='http://example.com'>
<img src="http://example.com">
<img src=http://example.com>
<img src='foo/bar/baz.png'>
<img src="foo/bar/baz.png">
<img src=foo/bar/baz.png>
</body></html>
HTML

$w->find('img')->each(sub {
    my (undef, $img) = @_;
    my $u = URI->new($img->attr('src'));
    unless ($u->scheme) {   # skip absolute URIs
        $u->path_segments('SMTH', ($u->path_segments)[-1]);
        $img->attr('src', $u);
    }
});
print $w->html;
daxim
  • 39,270
  • 4
  • 65
  • 132
1

Well, it's quite easy to fix it:

my $content = qq{<img src='h};
if ($content =~ m{src=('*)(?!http://)}) {
   print "Match '$1'\n";
}
else {
   print "No match\n";
}

But explaining the bug you described (and I think that's really a bug of Perl regex engine - why ('*) should match differently in 'h and 'i cases?) is another story. )

UPDATE: forgive me for submitting to the Cthulhu ways, but this code might do what you asked for:

sub correct { # just an example
  my $orig = shift;
  $orig =~ s/\.gif$/\.jpg/;
  return $orig;
}

my $img = "<img src='http://localhost.com/pic.gif' />";
$img =~ s{
  (< \s* img \s+ src \s* = \s*)
  (["']?)
  ([^ '">]+)
  \2
}{ 
  $1 . $2 . ( substr($3, 0, 7) eq 'http://' ?  $3  : correct $3 ) . $2
}xe;

print $img;

Still, those who said that it's better to use HTML Parser, any of them, got the biggest clue, I think. )

raina77ow
  • 103,633
  • 15
  • 192
  • 229
  • For the reason that would lead us too far astray this solution is not suitable. If you could tell me this other story I'd be delighted. – user1463382 Jun 18 '12 at 11:02
  • Then please describe why this solution is not suitable, won't you? Anyway, the point was not in the code, but rather the decision to use _two_ 'lookups' instead of a single one. – raina77ow Jun 18 '12 at 11:12
  • OK, here's the whole story. The idea is to replace the contents of the src atrribute of an img tag. The rules should be: – user1463382 Jun 18 '12 at 11:20
  • Pardon the whole story is too long, it does not fit the comment area. – user1463382 Jun 18 '12 at 11:33
  • Thanks. That would indeed work. And yes, I agree that HTML should be parsed rather than regexed. However I had to make a quick and dirty patch. This one would do. But having the answer why lookahead behaves strangely would also be great. Am I asking for too much? :) – user1463382 Jun 18 '12 at 13:46
  • Well, I'd recommend creating another question specifically for that issue - it's really interesting. I have some suspicions about what cause that bug, but I don't know Perl's internals well enough to prove them. ) – raina77ow Jun 18 '12 at 13:49
  • Actually the queastion WAS about this issue. Frankly, I don't know how to rephrase it more to the point :) – user1463382 Jun 19 '12 at 05:09