Parse HTML Page For Links With Regex Using Perl

Question

Possible Duplicate:
How can I remove external links from HTML using Perl?

Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.

There are lots of links like this:

<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
        (1992)</a>

I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.

Thanks,

Cody

Your question is confusing: 1. There is a distinction between lists and arrays in Perl, but it's not the sort of distinction you seem to have in mind. 2. To capture matches, you use =~. Here's another distinction that doesn't exist in Perl. — innaM, Nov 05 '09 at 21:02
dupe of http://stackoverflow.com/questions/1598053/how-can-i-remove-external-links-from-html-using-perl and http://stackoverflow.com/questions/1651276/how-can-i-extract-data-from-html-tables-in-perl — Ether, Nov 05 '09 at 21:04
Thanks, Ether, I couldn't make up my mind about which of the many, many questions to pick. — innaM, Nov 05 '09 at 21:06
Bart, it was over from PHP. Also guy's i've read the other questions as well as Ether's comments and Sinan's. I have been one of those guys that says "Regex is right for everything!" ever since I got over that learning curve. I'm looking into HTML::Parser right now though, and I should be able to finish this project pretty quickly with this. I'll be able to finsih this project today now! :) — codygman, Nov 05 '09 at 21:14

Sinan Ünür · Accepted Answer · 2009-11-05T21:12:00.177

10

Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.

Or, consider the following simple example:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

my @hrefs;

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
        push @hrefs, $href if $href =~ m!/en/subtitles/!;
    }
}

print "$_\n" for @hrefs;

__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath 
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" 
class="bnone">Death Becomes Her
                (1992)</a>

Output:

/en/subtitles/3586224/death-becomes-her-en

edited Nov 05 '09 at 21:12

answered Nov 05 '09 at 21:03

Sinan Ünür

116,958
15
196
339

1

Metaphysical +1 (I'm out of upvotes). – Chris Lutz Nov 05 '09 at 21:08
Thank you, Chris. Been in that situation many times ;-) – Sinan Ünür Nov 05 '09 at 21:25

score 4 · Answer 2 · answered Nov 05 '09 at 21:08

4

Don't use regexes. Use an HTML parser like HTML::TreeBuilder.

my @links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;

my @links = map { $_->attr('href') } $tree->look_down( _tag => 'a');

$tree = $tree->delete;

# Do stuff with links array

answered Nov 05 '09 at 21:08

daotoad

26,689
7
59
100

+1 It works but for files of unknown size, I tend to avoid building the whole document tree. – Sinan Ünür Nov 05 '09 at 21:14
HTML::TreeBuilder has handled all my needs with ease. I've never needed to parse huge HTML files that needed one of the line-by-line type parsers, so I can't just dash such a script off. However, if you've got huge files, you definitely don't want to hold the whole tree in RAM. – daotoad Nov 06 '09 at 07:18

score 0 · Answer 3 · answered Nov 05 '09 at 21:08

URLs like the one in your example can be matched with a regular expression like

($url) = /href=\"([^\"]+)\"/i

If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.

Parse HTML Page For Links With Regex Using Perl

3 Answers3

Linked