0

I would like to scrape web pages that load content dynamically with Javascript or similar.

Something like a headless browser, that I could use on a Linux Shared Host without X.

I can use PHP, Perl, Ruby or Python.

Do any of you know some framework/headless browser that can help me?

Thank you very much.

  • 1
    possible duplicate of [headless internet browser?](http://stackoverflow.com/questions/814757/headless-internet-browser) – daxim Jul 20 '12 at 15:56
  • 1
    Is there any reason you can't get an inexpensive VPS and install whatever you want on it? Shared hosting is usually a terrible place to run intensive operations like this. – tadman Jul 20 '12 at 16:07

3 Answers3

1

Try Selenium to control the browser if you need to simulate key presses or clicks in order to get the content to load.

For a headless browser, there are some listed here: headless internet browser?

Community
  • 1
  • 1
Matt Gibson
  • 14,616
  • 7
  • 47
  • 79
1

See library WWW::Scripter

Synopsis:

use WWW::Scripter;

$w = new WWW::Scripter;
$w->use_plugin('Javascript');
$w->get('http://some.site.com/that/uses/javascript');
$w->content; # returns the HTML content, possibly modified by scripts
$w->eval('alert("Hello from JavaScript")');
$w->document->getElementsByTagName('div')->[0]->...
Ωmega
  • 42,614
  • 34
  • 134
  • 203
-2

Using Perl WWW::Mechanize in Perl. This module has numerous methods that can perform web browser like functions. Below is a Sample code:

use WWW::Mechanize;
use strict;

my $username = "admin";
my $password = "welcome1";  
my $outpath  = "/home/data/output";
my $fromday = 7;
my $url  = "https://www.myreports.com/tax_report.php";
my $name = "tax_report";
my $outfile = "$outpath/$name.html";

my $mech = WWW::Mechanize->new(noproxy =>'0');  

$mech->get($url);
$mech->field(login => "$username");
$mech->field(passwd => "$password");

$mech->add_handler("request_send",  sub { shift->dump; return });
$mech->add_handler("response_done", sub { shift->dump; return });

$mech->click_button(value=>"Login now");

my $response = $mech->content();

print "Generating report: $name...\n";

open (OUT, ">>$outfile")|| die "Cannot create report file $outfile";
print OUT "$response";
close OUT;

In-case you want to handle Javascripts in the web-page (which you want to scrape), you can have a look at WWW::Mechanize::Firefox, but this may require installing the MozRepl plugin for Mozilla.

Anjan Biswas
  • 7,746
  • 5
  • 47
  • 77