1

I'm improving an old script that I had that downloaded some wallpapers for me. I need to know how many pages of wallpapers one category have. Each link has the number of the page as its text, ie:

<a href="/planes-desktop-wallpapers/page/8">8</a>
<a href="/planes-desktop-wallpapers/page/9">9</a>
<a href="/planes-desktop-wallpapers/page/10">10</a>

So I need to capture the number ten, but I'm not so well versed in regex, how can I retrieve the number of pages in this case?

tnx in advance!

XVirtusX
  • 679
  • 3
  • 11
  • 30

2 Answers2

5

You do not want to be parsing HTML using regular expressions. Using a regular expression will sooner or later falsify your data in this case. You'll be far better off using a module to do this for you.

In this example we are using HTML::TreeBuilder and List::Util. If you're wanting the highest in each category, another way to do this is using TreeBuilder::XPath to query all in specific sections.

use strict;
use warnings;
use HTML::TreeBuilder;
use List::Util qw( max );

my $data
   = '<a href="/planes-desktop-wallpapers/page/8">8</a>\n'
   . '<a href="/planes-desktop-wallpapers/page/9">9</a>\n'
   . '<a href="/planes-desktop-wallpapers/page/10">10</a>'
   ;

my $tr = HTML::TreeBuilder->new_from_content($data); 

my @vals =
     map { [ $_->attr('href'), $_->content_list ] } 
     max ( $tr->look_down( _tag => 'a') );

use Data::Dumper;
print Dumper \@vals;

__OUTPUT__
$VAR1 = [
          [
            '/planes-desktop-wallpapers/page/10',
            '10'
          ]
        ];

If you want just the text (number) instead just do:

my @vals = map { $_->content_list } max ( $tr->look_down( _tag => 'a') );
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • One caveat about HTML::TreeBuilder, from its perldoc page: "When you pass a filename to "parse_file", HTML::Parser opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing." See https://metacpan.org/module/HTML::TreeBuilder – shawnhcorey Jul 06 '13 at 12:23
  • I'm connecting to it throug WWW::Mechanize... one other thing, could you help me with the regex to find all the links related to the page numbers? I was trying this: $mech->find_all_links(text_regex=> qr/\d+/); Thanks! – XVirtusX Jul 06 '13 at 18:31
  • Perhaps, `my @links = map { $_->text } grep { $_->url =~ qr/planes.*?/ } $mech->find_all_links; print max @links;` – hwnd Jul 06 '13 at 20:17
3

DISCLAIMER: In general, parsing HTML with regex is frowned upon. See:

RegEx match open tags except XHTML self-contained tags

But this looks like a pretty limited/simple case so to do it using regex, you can use this:

my $string = '<a href="/planes-desktop-wallpapers/page/8">8</a>';

$string =~ /a href="\/planes-desktop-wallpapers\/page\/(\d+)">(\d+)<\/a>/;

my $pageNumber = $1;
print $pageNumber . "\n";
Community
  • 1
  • 1
go-oleg
  • 19,272
  • 3
  • 43
  • 44
  • 1
    The number `$1, $2, etc..` variables are successful matches of `last match, capture groups, substitution operator` that were applied. – hwnd Jul 06 '13 at 17:56