0

I'm trying to figure out why this code won't run on some sites. Here is a working version:

my $url = "http://www.bbc.co.uk/news/uk-36263685";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
    $html = join "\n", <READPAGE>;
close(READPAGE);

# works ok with the BBC page, and almost all others
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
   print qq|FOO: got header...\n|;
}

..and then this broken version , just seems to lock up: (exactly the same code - just a different URL)

my $url = "http://www.sport.pl/euro2016/1,136510,20049098,euro-2016-polsat-odkryl-karty-24-mecze-w-kanalach-otwartych.html";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
    $html = join "\n", <READPAGE>;
close(READPAGE);

# Locks up with this regex. Just seems to be some pages it does it on
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
   print qq|FOO: got header...\n|;
}

I can't work out whats going on with it. Any ideas?

Thanks!

UPDATE: For anyone interested, I ended up moving away from the Perl module I was using to extract the info, and went for a more robust HTML::Parser method. Here is the module, if anyone wants to use it as a base:

 package MetaExtractor;
 use base "HTML::Parser";
 use Data::Dumper;

 sub start {
     my ($self, $tag, $attr, $attrseq, $origtext) = @_;

     if ($tag eq "img") {
         #print Dumper($tag,$attr);

         if ($attr->{src} =~ /\.(jpe?g|png)/i) {
            $attr->{src} =~ s|^//|http://|i; # fix urls like //foo.com
            push @{$Links::COMMON->{images}}, $attr->{src};
         }
     }

     if ($tag =~ /^meta$/i &&  $attr->{'name'} =~ /^description$/i) {
         # set if we find <META NAME="DESCRIPTION"
         $Links::COMMON->{META}->{description} = $attr->{'content'};
     } elsif ($tag =~ /^title$/i && !$Links::COMMON->{META}->{title}) {
         $Links::COMMON->{META}->{title_flag} = 1;
     } elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:description$/i) {
         $Links::COMMON->{META}->{og_desc} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:image$/i) {
         $Links::COMMON->{META}->{og_image} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:description$/i) {
         $Links::COMMON->{META}->{tw_desc} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:image:src$/i) {
         $Links::COMMON->{META}->{tw_image} = $attr->{content}
     }
 }

 sub text {
     my ($self, $text) = @_;
     # If we're in <H1>...</H1> or  <TITLE>...</TITLE>, save text
     if ($Links::COMMON->{META}->{title_flag}) { $Links::COMMON->{META}->{title} .= $text; }
 }

 sub end {
     my ($self, $tag, $origtext) = @_;

     #print qq|END TAG: '$tag'\n|;

     # reset appropriate flag if we see </H1> or </TITLE>
     if ($tag =~ /^title$/i) { $Links::COMMON->{META}->{title_flag} = 0; }
 }

It will extract:

Title Meta description (not meta keywords, but its simple enough to use) FB Image FB Description Twitter Image Twitter Description All the images found (it doesn't do anything to fancy with them... i.e pages that have relative URLs ... but I'm gonna have a play with that as time permits)

Simply call with:

        my $html;
        open (READPAGE,"<:encoding(UTF-8)","/home/aycrca/public_html/cgi-bin/admin/tmp/$now.txt");

            my $p = new MetaExtractor;
            while (<READPAGE>) {
                $p->parse($_);
            }
            $p->eof;

        close(READPAGE);
Andrew Newby
  • 4,941
  • 6
  • 40
  • 81
  • Ugh. Don't use regex to parse a tagged language (like HTML). It's filthy. For that matter, fetching your data with `curl` rather than `LWP` is kinda redundant too. – Sobrique May 11 '16 at 14:35
  • There are two possibilities: A likely cause is the tag in that page is titanic. And also because you have two non-greedy searches you have a regex that might need to jump back and try again a lot. Another likely cause is this page contains
    tags, which match the first part of the regex, and can cause the same jump back and try again ad nausium behavior.
    – OmnipotentEntity May 11 '16 at 14:38
  • @Sobrique - this is from a Perl module (HTML::Miner). Do you have a better suggestion for extracting the meta tags, and images from a HTML source? – Andrew Newby May 11 '16 at 14:38
  • 2
    http://stackoverflow.com/a/1732454/5830574 (sorry, couldn't resist...) – PerlDuck May 11 '16 at 14:40
  • 1
    Andrew, look into HTML::Parser http://www.foo.be/docs/tpj/issues/vol5_1/tpj0501-0003.html – OmnipotentEntity May 11 '16 at 14:40
  • @OmnipotentEntity - thanks, will check that out :) – Andrew Newby May 11 '16 at 14:45
  • Yes. use a parser. That's what they're for. HTML is contextual. Regular expression are not. They can never work fully, and every time you try, you create some brittle code that may mysteriously break, because the assumptions you made are no longer valid. – Sobrique May 11 '16 at 15:39
  • I second the recommandation to just use a parser, and HTML::Parser (https://metacpan.org/pod/HTML::Parser) works for me. – Henrik supports the community May 11 '16 at 16:05
  • @Henrik - already done it using HTML::Parser :) (didn't take as long as I was expecting, and kinda wishing I'd just done it, before I spent the time trying to debug this module!) – Andrew Newby May 11 '16 at 16:10

2 Answers2

4

It isn't an inifinite loop, it is just slow. It is finding <header> tags too, and for each one it has to go through the rest of the file looking for an ending </head> tag (which isn't there). Change it to:

`m/<head\b.*?>(.*?)<\/head>/gis`

The problem seems exacerbated by treating the non-utf8 file as utf8.

ysth
  • 96,171
  • 6
  • 121
  • 214
  • thanks, that did the trick :) What is the \b doing? I thought that was for a boundary? And yes, its not an ideal bit of coding - but its what I have to work with on the perl module we are using :) Thanks! – Andrew Newby May 11 '16 at 14:42
  • \b means word boundary. Meaning that after the d can't be another letter. It prevents
    from matching.
    – OmnipotentEntity May 11 '16 at 14:44
  • aaaah very clever! Would never had thought of that :) Going to have a play with HTML::Parser now as well... just to see if I can get a cleaner solution going. I'll still upvote this though when it lets me :) – Andrew Newby May 11 '16 at 14:47
4

You have found an instance of catastrophic backtracking (q.v.)

Even for those sites for which your regex pattern works, the matching will be very lengthy and CPU-intensive. You should avoid the .*? where possible and use a negated character class instead

If you use this, all should be well

$html =~ m| <head\b[^<>]*> (.*) </head> |gisx

<head.*?> is supposed to match just one HTML tag, but there is nothing to prevent the regex engine from searching right to the end of the file. Changing this to <head[^<>]*> will only allow it to match non-angle-brackets after head, which will be only a few characters if any

The captured expression is less simple as you presumably want to match tags contained within the <head> element so the negated character class won't work. However, catastrophic backtracking is almost always the result of multiple wildcards acting simultaneously, so every possible match from one wildcard must be matched with every possible match from another, resulting in exponential complexity. With just one wildcard left the regex should work fine

Note also that I have used an alternative delimiter for the regex so that the slash doesn't need to be escaped, and I have added a word boundary \b after <head to prevent it from matching <header or similar

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thanks for taking the time to explain it. I've already fixed it with the help of ysth - but I've upvoted you as well. I'm actually having a play with HTML::Parser now, as that seems a much quicker and cleaner method of doing this. I was just being lazy and using a pre-existing Perl module - but it seems like its a very poor option :( – Andrew Newby May 11 '16 at 15:03
  • 1
    @AndrewNewby: `HTML::Parser` is really a base class for building more convenient modules. I suggest that you look at [HTML::TreeBuilder](https://metacpan.org/pod/HTML::TreeBuilder) which is built on `HTML::Parser` but uses it to build a Perl data structure that you can navigate and modify at will – Borodin May 11 '16 at 15:21
  • thanks... but I've almost finished it with HTML::Parser. Just the time it takes to get used to its structure, etc – Andrew Newby May 11 '16 at 15:28