1

I've created a perl script to use HTML::TableExtract to scrape data from tables on a site.

It works great to dump out table data for unsecured sites (i.e. HTTP site), but when I try HTTPS sites, it doesn't work (the tables_report line just prints blank.. it should print a bunch of table data).

However, if I take the content of that HTTPS page, and save it to an html file and then post it on an unsecured HTTP site (and change my content to point to this HTTP page), this script works as expected.

Anyone know how I can get this to work over HTTPS?

#!/usr/bin/perl
use lib qw( ..); 
use HTML::TableExtract; 
use LWP::Simple; 
use Data::Dumper; 
# DOESN'T work:
my $content = get("https://datatables.net/"); 
# DOES work:
#   my $content = get("http://www.w3schools.com/html/html_tables.asp"); 
my $te = HTML::TableExtract->new();
$te->parse($content);
print $te->tables_report(show_content=>1);
print "\n";
print "End\n";

The sites mentioned above for $content are just examples.. these aren't really the sites I'm extracting, but they work just like the site I'm really trying to scrape.

One option I guess is for me to use perl to download the page locally first and extract from there, but I'd rather not, if there's an easier way to do this (anyone that helps, please don't spend any crazy amount of time coming up with a complicated solution!).

Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133
ChrisS
  • 25
  • 5

4 Answers4

1

The problem is related to the user agent that LWP::Simple uses, which is stopped at that site. Use LWP::UserAgent and set an allowed user agent, like this:

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
my $url = 'https://datatables.net/';

$ua->agent("Mozilla/5.0");  # set user agent
my $res = $ua->get($url);   # send request

# check the outcome
if ($res->is_success) {
   # ok -> I simply print the content in this example, you should parse it
   print $res->decoded_content;
}
else {
   # ko
   print "Error: ", $res->status_line, "\n";
}
Miguel Prz
  • 13,718
  • 29
  • 42
  • Thanks for the response! This – ChrisS Oct 15 '16 at 16:46
  • Sorry, I'm new to StackOverflow and hit enter too soon. This almost worked. I had to change one part to `my $ua = LWP::UserAgent->new( ssl_opts => { verify_hostname => 0 }, );` or I would always get an error saying certificate verify failed. I combined this solution with Chankey's parsing. I'll try and post my final code below. – ChrisS Oct 15 '16 at 16:50
  • 1
    Actually after reading around, sounds like it was more recommended to use `ssl_opts => { SSL_verify_mode => 'SSL_VERIFY_PEER' },`, so that's what I did instead. Not sure if there are security issues with this at all, but I'm not interested in security here.. just trying to pull in some statistics off of a public site. – ChrisS Oct 15 '16 at 16:59
0

This is because datatables.net is blocking LWP::Simple requests. You can confirm this by using below code:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple; 
print is_success(getprint("https://datatables.net/"));

Output:

$ perl test.pl 
403 Forbidden <URL:https://datatables.net/>

You could try using LWP::RobotUA. Below code works fine for me.

#!/usr/bin/perl
use strict;
use warnings;

use LWP::RobotUA;
use HTML::TableExtract;

my $ua = LWP::RobotUA->new( 'bot_chankey/1.1', 'chankeypathak@stackoverflow.com' );
$ua->delay(5/60); # 5 second delay between requests
my $response = $ua->get('https://datatables.net/');
if ( $response->is_success ) {
    my $te = HTML::TableExtract->new();
    $te->parse($response->content);
    print $te->tables_report(show_content=>1);
}
else {
    die $response->status_line;
}
Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133
  • Thank you Chankey! Your answer worked similarly to Miguel's... it gave me an error of "certificate verify failed" when I tried it. Presumably I had to set a flag like I had googled for Miguel's answer for this to work. In the end, I marked his as the answer because that's what the majority of my code looked like now. But if I could pick two answers, I'd pick yours as well. Upvoted, anyways, but I'm too new for it to show here. I appreciate your great help!!! – ChrisS Oct 15 '16 at 17:01
0

In the end, a combination of Miguel and Chankey's responses provided my solution. Miguel's made up most of my code, so I selected that as the answer, but here is my "final" code (got a lot more to do, but this is all I couldn't figure out.. the rest should be no problem).

I couldn't quite get either mentioned by Miguel/Chankey to work, but they got me 99% of the way.. then I just had to figure out how to get around the error "certificate verify failed". I found that answer with Miguel's method right away, so in the end, I mostly used his code, but both responses were great!

#!/usr/bin/perl

use lib qw( ..); 
use strict;
use warnings;
use LWP::UserAgent;

use HTML::TableExtract; 
use LWP::RobotUA;
use Data::Dumper; 

my $ua = LWP::UserAgent->new(
   ssl_opts => { SSL_verify_mode => 'SSL_VERIFY_PEER' },
);
my $url = 'https://WebsiteIUsedWasSomethingElse.com';

$ua->agent("Mozilla/5.0");  # set user agent
my $res = $ua->get($url);   # send request

# check the outcome
if ($res->is_success) 
{   
   my $te = HTML::TableExtract->new();
   $te->parse($res->content);
   print $te->tables_report(show_content=>1);
}
else {
   # ko
   print "Error: ", $res->status_line, "\n";
}
ChrisS
  • 25
  • 5
0
my $url = "https://ohsesfire01.summit.network/reports/slices";
my $user = 'xxxxxx';
my $pass = 'xxxxxx';
my $ua = new LWP::UserAgent;
my $request = new HTTP::Request GET=> $url;
# authenticate
$request->authorization_basic($user, $pass);

my $page = $ua->request($request);
Stephen
  • 1
  • 1