1

I'm new to everything. Please help. I'm trying to crawl every

<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>

in a webpage. I want to catch the /v/name/idlike123123ksajdfk part. (Knowing that the

<div class="name"><a href="/v/

part is fixed) So I wrote the regular expression (can make you laugh):

~m#<div class="name"><a href="(/v/.*?)">#

It will be very helpful if you correct my stupid code.

Ivan Wang
  • 8,306
  • 14
  • 44
  • 56
  • 2
    Regexing html is covered here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – beresfordt May 18 '12 at 11:36

4 Answers4

6

Using a robust HTML parser (see http://htmlparsing.com/ for why):

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
<div class="name"><a href="/v/name/idlike123123ksajdfk">name</a></div>
HTML

my @v_links = $w->find('div.name > a[href^="/v/"]')->attr('href');
daxim
  • 39,270
  • 4
  • 65
  • 132
  • You should improve your answer a little bit to ensure `@links` contains just links starging with `/v/` as stays in OP's post. – Ωmega May 18 '12 at 11:51
1

There are plenty of Perl modules that extract links from HTML. WWW::Mechanize, Mojo::DOM, HTML::LinkExtor, and HTML::SimpleLinkExtor can do it.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
1

Web scraping with Mojolicious is probably simplest way to do it in Perl nowadays

http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping

alexsergeyev
  • 525
  • 2
  • 7
0

You should not use regex for parsing HTML, as there are many libraries for such parsing.

Daxim's answer is good example.


However if you want to use regex anyway and you have your text assigned to $_, then

my @list = m{<div class="name"><a href="(/v/.*?)">}g;

will get you a list of all findings.

Ωmega
  • 42,614
  • 34
  • 134
  • 203