5

I am interested in learning Perl. I am using Learning Perl books and cpan's web-sites for reference.

I am looking forward to do some web/text scraping application using Perl to apply whatever I have learnt.

Please suggest me some good options to begin with.

(this is not a homework. want to do something in Perl that would help me exploit basic Perl features)

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Rajan
  • 626
  • 3
  • 8
  • 18
  • 6
    and a lot of people think the opposite! – plusplus Feb 01 '11 at 10:23
  • 2
    @Joe: What a silly comment. Flagging it. – codaddict Feb 01 '11 at 10:23
  • 2
    in Joe's defence, it was because of the 'scrapping' spelling mistake (now corrected) in the title. However the endless perl-bashing does get pretty tiresome (thankfully there seems to have been less on SO recently). – plusplus Feb 01 '11 at 10:30
  • @plusplus I regularly see interesting misspellings in questions related to other languages, but for some reason, I am not moved to make snide remarks about those languages. I find it interesting that there are quite a few people for whom "Perl must suck" for them to feel good about their choice of programming language. See also: http://blog.nu42.com/2010/12/why-does-perl-have-to-suck-for-you-to.html – Sinan Ünür Feb 01 '11 at 10:34
  • Come on guys, can't you take a joke? And yes, if you looked at the history you'd see that the original question said "scrapping". – Joe Feb 01 '11 at 10:35

5 Answers5

12

If the web pages you want to scrape require JavaScript to function properly, you are going to need more than what WWW::Mechanize can provide you. You might even have to resort to controlling a specific browser via Perl (e.g. using Win32::IE::Mechanize or WWW::Mechanize::Firefox).

I haven't tried it, but there is also WWW::Scripter with the WWW::Scripter::Plugin::JavaScript plugin.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • 3
    I always try to skip the Javascript by directly observing the HTTP requests and responses that it builds instead of executing or analysing the JS. The "Web Scraping Proxy" from AT&T is the bomb for reverse engineering websites, and it logs the traffic in the form of WWW::Mechanize Perl code to boot! – tadmc Feb 01 '11 at 14:31
  • @tadmc Sure, but controlling Internet Explorer (via `Win32::OLE`) directly saved me a boatload of time on many occasions. – Sinan Ünür Feb 01 '11 at 17:13
10

As others have said, WWW::Mechanize is an excellent module to use for web scraping tasks; you'll do well to learn how to use it, it can make common tasks very easy. I've used it for several web scraping tasks, and it just takes care of all the boring stuff - "go here, find a link with this text and follow it, now find a form with fields named 'username' and 'password', enter these values and submit the form...".

Scrappy is also well worth a look - it lets you do a lot with very little code - an example from its documentation:


    my $spidy = Scrappy->new;

    $spidy->crawl('http://search.cpan.org/recent', {
        '#cpansearch li a' => sub {
            print shift->text, "\n";
        }
    });

Scrappy makes use of Web::Scraper under the hood, which you might want to look at too as another option.

Also, if you need to extract data from HTML tables, HTML::TableExtract makes this dead easy - you can locate the table you're interested in by naming the headings it contains, and extract data very easily indeed, for example:


    use HTML::TableExtract;
    $te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] );
    $te->parse($html_string) or die "Didn't find table";
    foreach $row ($te->rows) {
        print join(',', @$row), "\n";
    }
David Precious
  • 6,544
  • 1
  • 24
  • 31
8

The most popular web scraping module for Perl is WWW::Mechanize, which is excellent if you can't just retrieve your destination page but need to navigate to it using links or forms, for instance, to log in. Have a look at its documentation for inspiration. If your needs are simple, you can extract the information you need from the HTML using regular expressions (but beware your sanity), otherwise it might be better to use a module such as HTML::TreeBuilder to do the job.

A module that seems interesting, but that I haven't really tried yet, is WWW::Scripter. It's a subclass of WWW::Mechanize, but has support for Javascript and AJAX, and also integrates HTML::DOM, another way to extract information from the page.

Community
  • 1
  • 1
mscha
  • 6,509
  • 3
  • 24
  • 40
  • +1 It should be mentioned that `WWW::Mechanize` provides the kind of parsing needed to navigate using links and forms. – Sinan Ünür Feb 01 '11 at 10:38
3

Try the Web-Scraper Perl module. A beginners tutorial can be found here.

It's safe, easy to use and fast.

juFo
  • 17,849
  • 10
  • 105
  • 142
1

You may also want to have a look at my new Perl wrapper over Java HtmlUnit. It is very easy to use, e.g. look at the quick tutorial here:

http://code.google.com/p/spidey/wiki/QuickTutorial

By tomorrow I will publish some detailed installation instructions and a first release. Unlike Mechanize and alike you get some JavaScript support and it is way faster and less memory demanding than screen scraping.