2

I am facing a problem with a Perl regex. On an img element, I want to match the src attribute with a value starting with /file?id, and with any class and alt attribute. I want to ignore the rel attribute which sometimes exist and sometimes not exist like below:

<img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">    

<img  src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">

My question is how to handle the optional rel attribute.

I am trying this for the rel attribute match:

(?!\s+(rel)="([^"]+)")

It works when there is no rel attribute but fails when the img has a rel attribute.

Borodin
  • 126,100
  • 9
  • 70
  • 144
Laeeq
  • 403
  • 1
  • 3
  • 14
  • 7
    [Don't do that](http://stackoverflow.com/a/1732454/19068), use a real [HTML parser](https://metacpan.org/module/HTML::TreeBuilder::XPath) instead. – Quentin Jul 19 '13 at 07:49
  • January: You meant (ii) learning how to *not* regex. – innaM Jul 19 '13 at 08:17
  • 1
    @Quentin using a regex to match a known, limited subset of HTML/XML can be fine, depending on the desired level of robustness vs. complexity & performance. Aka it's OK to break the rules when you know why and what the consequences are. – instanceof me Jul 19 '13 at 08:43
  • 1
    @January: Not really. Using a ready-built parser is almost always quicker and less dirty than using regular expressions. And HTML is a poor target for regular expressions, as evinced by this very question, and will mislead the learner. – Borodin Jul 19 '13 at 08:54
  • Exactly there are specific cases you will want to use regex but XML/HTML isn't one. – Prix Jul 19 '13 at 09:00
  • 1
    @January: I wholeheartedly disagree. A regex cannot possibly be *"more durable"* than a parser written for the domain. In particular, the task is so complex that it needs very comprehensive testing, and can never be confirmed as correct. Learning a language by writing code for data it wasn't intended to process cannot be a good way to go. HTML is not a regular language, and you can never write a complete general HTML-processing solution using regular expressions. You may as well advocate writing client-side scripts in Logo. – Borodin Jul 19 '13 at 09:23
  • 1
    @January: No. Using a regular expression to process an irregular language is a *bad idea*. Please don't evangelise your wrong-headedness. – Borodin Jul 19 '13 at 09:59
  • @January you said it right 15~20 years ago we were at the point of inventing the wheel, today we don't need to invent it, it exists so we can simple deploy the existent wheels instead of going back to a 15~20 years ago process of inventing, testing and finding all the possible issues for it. – Prix Jul 19 '13 at 10:14
  • @Borodin. Yeah, whatever, I'm outta here. – January Jul 19 '13 at 10:44

3 Answers3

2

This is trivial to do using a proper HTML parser. This program demonstrates using HTML::TreeBuilder and the look_down method.

It is searching for all elements with:

  • A tag name of 'img'
  • A src attribute that matches the regex qr|^/file\?id=|
  • A class attribute that matches the null regex (i.e. a class attribute with any value)
  • An alt attribute that matches the null regex

You don't say what you want to do with the elements once you've found them. This code just uses as_HTML to display them.

use strict;
use warnings;

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder::XPath->new_from_file(\*DATA);
my @images = $html->look_down(
  _tag => 'img',
  src => qr|^/file\?id=|,
  class => qr//,
  alt => qr//
);
print $_->as_HTML, "\n" for @images;

__DATA__
<html>
  <head>
    <title>Page title</title>
  </head.
  <body>
    <img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">    
    <img  src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">
    <img  src="/file" class="bbc_img" alt="myimagess.jpg"> /* mismatch id="" */
    <img  src="/file?id=13166" alt="myimagess.jpg">        /* no class="" */
    <img  src="/file?id=13166" class="bbc_img">            /* no alt="" */
  </body>
</html>

output

<img alt="myimagess.jpg" class="bbc_img" rel="lightbox[45451]" src="/file?id=13166" />
<img alt="myimagess.jpg" class="bbc_img" src="/file?id=13166" />
Borodin
  • 126,100
  • 9
  • 70
  • 144
2

Web::Query wins!

use Web::Query 'wq';
my $html = <<'';
<html>
<img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess1.jpg">
<img class="bbc_img" src="/file?id=13167" alt="myimagess2.jpg">
<img src="/file?id=13168" class="bbc_img" >
<img src="/file?id=13169" alt="myimagess3.jpg">
<img  src="/foo" class="bbc_img" alt="myimagess.jpg4">

print for wq($html)->find('img[src^="/file?id="][class][alt]')->attr('src');
__END__
/file?id=13166
/file?id=13167

Learn from this: XPath is more powerful than CSS selectors, but CSS selectors are shorter.

daxim
  • 39,270
  • 4
  • 65
  • 132
  • agreed, although I am waiting for the obligatory XSH2 solution to crown the winner ;--) – mirod Jul 19 '13 at 09:57
1

A proper way to do this, using HTML::TreeBuilder::XPath. This will ignore rel and any other attribute, as well as not depend on the order of attributes in the tag.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;
use Test::More tests => 1;

my $root= HTML::TreeBuilder::XPath->new_from_content( do { local undef $/; <DATA> });

# this is the important part 
my @imgs= $root->findnodes( '//img[starts-with( @src,"/file?id=") and @class and @alt]');

# checks the results
my $hits= join ' ', map { "H:" . src_id( $_->{src}) } @imgs;
is( $hits, 'H:13166 H:13167', "one test");

# shows how to access the attributes
foreach my $img (@imgs)
  { warn "hit: src= $img->{src} - class=$img->{class} - alt: $img->{alt} - id= ", src_id( $img->{src}), "\n"; }

exit; 

sub src_id
  { my( $src)= @_;
    return $src=~  m{/file\?id=(.+)$} ? $1 : 'no id'; 
  }

__DATA__
<html>
  <head><title>Test HTML</title></head.
  <body>
    <img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess1.jpg">
    <img class="bbc_img" src="/file?id=13167" alt="myimagess2.jpg">
    <img src="/file?id=13168" class="bbc_img" >
    <img src="/file?id=13169" alt="myimagess3.jpg">
    <img  src="/foo" class="bbc_img" alt="myimagess.jpg4">
  </body>
</html>
mirod
  • 15,923
  • 3
  • 45
  • 65