Web-crawler optimization

Question

I am building a basic search engine using vector-space model and this is the crawler for returning 500 URLs and removes the SGML tags from the content. However, it is very slow (takes more than 30mins for retrieving the URLs only). How can I optimize the code? I have inserted wikipedia.org as an example starting URL.

use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

my $starting_url = 'http://en.wikipedia.org/wiki/Main_Page';
my @urls = $starting_url;
my %alreadyvisited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);
my $url_count = 0;

while (@urls) 
{ 
     my $url = shift @urls;
     next if $alreadyvisited{$url}; ## check if already visited

     my $request = HTTP::Request->new(GET => $url);
     my $response = $browser->request($request);

     if ($response->is_error())
     {
         print $response->status_line, "\n"; ## check for bad URL
     }
     my $contents = $response->content(); ## get contents from URL
     push @c, $contents;
     my @text = &RemoveSGMLtags(\@c);
     #print "@text\n";

     $alreadyvisited{$url} = 1; ## store URL in hash for future reference
     $url_count++;
     print "$url\n";

     if ($url_count == 500) ## exit if number of crawled pages exceed limit
     {
         exit 0; 
     } 


     my ($page_parser) = HTML::LinkExtor->new(undef, $url); 
     $page_parser->parse($contents)->eof; ## parse page contents
     my @links = $page_parser->links; 

     foreach my $link (@links) 
     {
             $test = $$link[2];
             $test =~ s!^https?://(?:www\.)?!!i;
             $test =~ s!/.*!!;
             $test =~ s/[\?\#\:].*//;
             if ($test eq "en.wikipedia.org")  ## check if URL belongs to unt domain
             {
                 next if ($$link[2] =~ m/^mailto/); 
                 next if ($$link[2] =~ m/s?html?|xml|asp|pl|css|jpg|gif|pdf|png|jpeg/);
                 push @urls, $$link[2];
             }
     }
     sleep 1;
}


sub RemoveSGMLtags 
{
    my ($input) = @_;
    my @INPUTFILEcontent = @$input;
    my $j;my @raw_text;
    for ($j=0; $j<$#INPUTFILEcontent; $j++)
    {
        my $INPUTFILEvalue = $INPUTFILEcontent[$j];
        use HTML::Parse;
        use HTML::FormatText;
        my $plain_text = HTML::FormatText->new->format(parse_html($INPUTFILEvalue));
        push @raw_text, ($plain_text);
    }
    return @raw_text;
}

I am not sure. I am new to perl and still learning how to write an efficient code. — user2154731, Apr 07 '13 at 15:56
You are trying to download the *entirety* of `en.wikipedia.org`. Apart from the likelihood that Wikipedia wouldn't like this at all, it would be something of an achievement if you could achieve that in under 30 minutes. Please think twice about doing things like this, and carefully examine the terms of service of any site you do it to. Most will not want their data abused like this. — Borodin, Apr 07 '13 at 16:01
Actually, I have to crawl my university website. Sorry, I forgot to mention that. I put this URL just as an example. — user2154731, Apr 07 '13 at 16:06

score 5 · Answer 1 · answered Apr 07 '13 at 16:37

5

Always use strict
Never use the ampersand & on subroutine calls
Use URI to manipulate URLs

You have a sleep 1 in there, which I assume is to avoid hammering the site too much, which is good. But the bottleneck in almost any web-based application is the internet itself, and you won't be able to make your program any faster without requesting more from the site. That means removing your sleep and perhaps making parallel requests to the server using, for instance, LWP::Parallel::RobotUA. Is that a way you should be going?

answered Apr 07 '13 at 16:37

Borodin

126,100
9
70
144

Thanks, the code is much faster when I remove the sleep. Will try using parallel requests. – user2154731 Apr 07 '13 at 16:59
Another significant issue, probably, you'd know if you had profiled the code, is that you're calling HTML::FormatText->new->format(parse_html( on every line of input when you could instead call it once per entire webpage. – Greg Lindahl Apr 09 '13 at 01:55

score 3 · Answer 2 · answered Apr 07 '13 at 16:54

3

Use WWW::Mechanize which handles all the URL parsing and extraction for you. So much easier than all the link parsing you're dealing with. It was created specifically for the sort of thing you're doing, and it's a subclass of LWP::UserAgent so you should just be able to change all your LWP::UserAgent to WWW::Mechanize without having to change any code, except for all the link extraction, so you can do this:

my $mech = WWW::Mechanize->new();
$mech->get( 'someurl.com' );
my @links = $mech->links;

and then @links is an array of WWW::Mechanize::Link objects.

answered Apr 07 '13 at 16:54

Andy Lester

91,102
13
100
152

thanks but I cannot use Mechanize since I have to execute the final code in a server in which that package is absent and I do not have the permission to install it – user2154731 Apr 07 '13 at 17:41
2

@user2154731: Learn how to [install Perl modules under your own home directory](http://stackoverflow.com/questions/251705/how-can-i-use-a-new-perl-module-without-install-permissions). You'll find it useful for far more than just WWW::Mechanize. – Ilmari Karonen Apr 07 '13 at 18:42

Web-crawler optimization

2 Answers2