How do I extract links from HTML with a Perl regex?

Question

I have a HUGE html which has many things I don't need, but inside it has URLs that are provided in the following format:

<a href="http://www.retailmenot.com/" class=l

I'm trying to extract the URLs... I tried, to no avail:

open(FILE,"<","HTML.htm") or die "$!";
my @str = <FILE>;

my @matches = grep { m/a href="(.+?") class=l/ } @str

Any idea on how to match this?

@soulSurfer2010 - did you get a specific error? Or just unexpected behaviour (and if so, what)? — martin clayton, Sep 25 '10 at 00:33
Maybe you just have a typo there; it says `"(.+?")` when it should be `"(.+?)"` — NullUserException, Sep 25 '10 at 00:36
I test it with grep, it sometimes i cathes the correct ones, and sometimes its still too greedy — snoofkin, Sep 25 '10 at 00:47
for example, this is a greeey one: a href="http://webcache.googleusercontent.com/search?q=cache:SY0IFA33Tg0J:www.coolsavings.com/+coupons&cd=5&hl=en&ct=clnk" onmousedown="return clk(this.href,'','','','5','','0CCwQIDAE')">Cached - Similar
tried it with : grep -iP --color=auto 'a href="(.+?)"\sclass=l FILE.TXT — snoofkin, Sep 25 '10 at 00:47
@soulSurfer2010, please edit your revisions and what you tried (the two comments previous to this one) *into the question* (hit the 'edit' link below the tags.) It looks better formatted properly, and far easier to read and work with. — David Thomas, Sep 25 '10 at 00:58
Why hasn't anybody linked to this classic http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ? — Yuji, Sep 25 '10 at 04:21

brian d foy · Accepted Answer · 2010-09-27T05:08:52.033

11

Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don't need a regex at all.

Here's a short example. You don't have to subclass. You just have to tell %HTML::Tagset::linkElements which attributes to collect:

#!perl
use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

$p = HTML::LinkExtor->new;
$p->parse( do { local $/; <> } );

my @links = grep { 
    my( $tag, %hash ) = @$_;
    no warnings 'uninitialized';
    $hash{class} eq 'foo';
    } $p->links;

If you need to collect URLs for any other tags, you make similar adjustments.

If you'd rather have a callback routine, that's not so hard either. You can watch the links as the parser runs into them:

use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

my @links;
my $callback = sub {
    my( $tag, %hash ) = @_;
    no warnings 'uninitialized';
    push @links, $hash{href} if $hash{class} eq 'foo';
    };

my $p = HTML::LinkExtor->new( $callback );
$p->parse( do { local $/; <DATA> } );

edited Sep 27 '10 at 05:08

answered Sep 25 '10 at 00:41

brian d foy

129,424
31
207
592

Great module, but seems like I dont only need the hrefs but , hrefs that have 'class=l ' after the link... – snoofkin Sep 25 '10 at 00:46
1

HTML::LinkExtor can help you figure out what other attributes are set. – brian d foy Sep 25 '10 at 00:48
@brian d foy, HTML::LinkExtor only collects attributes that are URLs. It doesn't collect the `class` attribute. You'd have to subclass it to ignore links with the wrong `class`. – cjm Sep 25 '10 at 03:28
Sorry that I didn't have time earlier to produce an example. No need for a subclass. – brian d foy Sep 25 '10 at 04:12
1

"You don't need a regex at all." And you should not use a regex at all. It has been said that if any phrase ought to be emblazoned on the top of SO, "you cannot use regular expressions to parse XML" is certainly one of them. – Jon Purdy Sep 25 '10 at 06:07
amazing!! amazing!! and again! amazing! Thanks a lot! – snoofkin Sep 25 '10 at 13:21

How do I extract links from HTML with a Perl regex?

1 Answers1

Linked

Related