0

Please tell me which module is used to scrape a website which is completely developed in ASP and all it's contents are not in proper HTML syntax.

2 Answers2

3

It does not matter which language was used to develop the website. All that you (client) get from the website is the produced HTML (or broken HTML in this case).

You can use the "LWP" library and the "get" function to read the website content into a variable... and then analyze it using regular expressions.

Like this:

use strict;
use LWP::Simple;
my $url = 'http://...';
my $content = get $url;
if ($content =~ m/.../) {
    ...
}
Viliam Búr
  • 2,154
  • 17
  • 14
2

Or you could use WWW::Mechanize. It builds upon LWP (which LWP::Simple is a very simple subset of) and provides lots of handy 'browser-like' behavior. For example, the typical session management of ASP-generated websites with login cookies and stuff gets handled by Mechanize automatically.

use strict; use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( 'http:://www.example.org/login.asp' );
$mech->submit_form(
    form_number => 3,
    fields      => {
        username    => 'test',
        password    => 'secret',
    }
);

While this first of all is good for testing, it still has LWPs inherited methods and you can access the plain request. You can thus access the request as well, while still having the power of the built-in parser to access forms and links.

Also consider using a proper HTML parser, even if the website's output is no very fancy. There are several of these around that can handle it. It will be a lot easier than just building up a bunch of regexes. These will get hard to maintain once you need to go back because the page has changed something.

Here's a list of related questions that have info on this subject:

Community
  • 1
  • 1
simbabque
  • 53,749
  • 8
  • 73
  • 136