0

I have a bunch of webpages I would like to navigate through a script and grab all the contents of. I know the link is the 18th link on every page. I have the following code as a test to just follow the link once and screen scrape:

use strict;
use WWW::Mechanize;

my $start = "http://*some-webpage*";

my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->get( $start );
open(Output, ">mech_test.txt") or die $!;
$mech->follow_link(url_regex => qr//,  n => 18 );
print Output $mech->response()->content();
close(Output);

Unfortunately the link I am trying to access has nothing in the href tag. Viewing source of the page the links looks like this:

<a href="" onclick="return _doClick('CA256D6E001A7020.80376e858b0791b1ca256d7300098304/$Body/0.155A', this, null)">Next &gt;&gt;</a>

I believe this is javascript and there is no way mechanize can follow this link. Any ideas to get around this?

user1249684
  • 11
  • 1
  • 2
  • 1
    This is a very frequently repeated question. http://stackoverflow.com/search?q=%5Bperl%5D+mechanize+%5Bjavascript%5D http://stackoverflow.com/questions/4767562/is-there-a-way-to-execute-javascript-in-perl http://stackoverflow.com/questions/3769015/how-can-i-handle-javascript-in-a-perl-web-crawler http://stackoverflow.com/questions/6683611/tricking-browser-into-calling-javascript-events – daxim Mar 22 '12 at 11:43

2 Answers2

1

You should use WWW::Scripter module, which is a subclass of WWW::Mechanize that uses the W3C DOM and provides support for scripting.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
-2

It is possible to be pure perl, if the JS is quite simple.

You have to find the javascript subroutine, and if it is reasonbly simple, you are able to reproduce it as perl sub.

Then you are able to build the links by yourself.

my @javascript_links = $html =~ m#return _doClick\((.*?)\)#gis;
#array contain 'CA256D6E001A7020.80376e858b0791b1ca256d7300098304/$Body/0.155A', this, null
my @links = extract_links(@javascript_links);
foreach my $link (@links){
  $mech->get( $link )
}
#***
sub extract_links{
 my $line = shift;
 my @params = split(/,/,$line);
 trim(@params);
 #mimic JS logic here, whatever it is
 my $link = "/some/path/here/to/add/some.php?someparam1=val1&param=$params[0]"; 
 return $link;

}
user1126070
  • 5,059
  • 1
  • 16
  • 15