I'm trying to figure out why this code won't run on some sites. Here is a working version:
my $url = "http://www.bbc.co.uk/news/uk-36263685";
`curl -L '$url' > ./foo.txt`;
my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
$html = join "\n", <READPAGE>;
close(READPAGE);
# works ok with the BBC page, and almost all others
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
print qq|FOO: got header...\n|;
}
..and then this broken version , just seems to lock up: (exactly the same code - just a different URL)
my $url = "http://www.sport.pl/euro2016/1,136510,20049098,euro-2016-polsat-odkryl-karty-24-mecze-w-kanalach-otwartych.html";
`curl -L '$url' > ./foo.txt`;
my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
$html = join "\n", <READPAGE>;
close(READPAGE);
# Locks up with this regex. Just seems to be some pages it does it on
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
print qq|FOO: got header...\n|;
}
I can't work out whats going on with it. Any ideas?
Thanks!
UPDATE: For anyone interested, I ended up moving away from the Perl module I was using to extract the info, and went for a more robust HTML::Parser method. Here is the module, if anyone wants to use it as a base:
package MetaExtractor;
use base "HTML::Parser";
use Data::Dumper;
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "img") {
#print Dumper($tag,$attr);
if ($attr->{src} =~ /\.(jpe?g|png)/i) {
$attr->{src} =~ s|^//|http://|i; # fix urls like //foo.com
push @{$Links::COMMON->{images}}, $attr->{src};
}
}
if ($tag =~ /^meta$/i && $attr->{'name'} =~ /^description$/i) {
# set if we find <META NAME="DESCRIPTION"
$Links::COMMON->{META}->{description} = $attr->{'content'};
} elsif ($tag =~ /^title$/i && !$Links::COMMON->{META}->{title}) {
$Links::COMMON->{META}->{title_flag} = 1;
} elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:description$/i) {
$Links::COMMON->{META}->{og_desc} = $attr->{content}
} elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:image$/i) {
$Links::COMMON->{META}->{og_image} = $attr->{content}
} elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:description$/i) {
$Links::COMMON->{META}->{tw_desc} = $attr->{content}
} elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:image:src$/i) {
$Links::COMMON->{META}->{tw_image} = $attr->{content}
}
}
sub text {
my ($self, $text) = @_;
# If we're in <H1>...</H1> or <TITLE>...</TITLE>, save text
if ($Links::COMMON->{META}->{title_flag}) { $Links::COMMON->{META}->{title} .= $text; }
}
sub end {
my ($self, $tag, $origtext) = @_;
#print qq|END TAG: '$tag'\n|;
# reset appropriate flag if we see </H1> or </TITLE>
if ($tag =~ /^title$/i) { $Links::COMMON->{META}->{title_flag} = 0; }
}
It will extract:
Title Meta description (not meta keywords, but its simple enough to use) FB Image FB Description Twitter Image Twitter Description All the images found (it doesn't do anything to fancy with them... i.e pages that have relative URLs ... but I'm gonna have a play with that as time permits)
Simply call with:
my $html;
open (READPAGE,"<:encoding(UTF-8)","/home/aycrca/public_html/cgi-bin/admin/tmp/$now.txt");
my $p = new MetaExtractor;
while (<READPAGE>) {
$p->parse($_);
}
$p->eof;
close(READPAGE);