6

I want to parse a Website into a Perl data structure. First I load the page with

use LWP::Simple;
my $html = get("http://f.oo");

Now I know two ways to deal with it. First are the regular expressions and secound the modules.

I started with reading about HTML::Parser and found some examples. But I'm not that sure about by Perl knowledge.

My code example goes on

my @links;

my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse($html);

foreach my $link(@links){
  print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}

sub start_handler{
  return if(shift ne 'a');
  my ($class) = shift->{href};
  my $self = shift;
  my $text;
  $self->handler(text => sub{$text = shift;},"dtext");
  $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname");
}

I don't understand why there is two times a shift. The secound should be the self pointer. But the first makes me think that the self reference is allready shiftet, used as a Hash and the Value for href is stored in $class. Could someone Explain this line (my ($class) = shift->{href};)?

Beside this lack, I do not want to parse all the URLs, I want to put all the code between <div class ="foo"> and </div> into a string, where lots of code is between, specially other <div></div> tags. So I or a module has to find the right end. After that I planed to scan the string again, to find special classes, like <h1>,<h2>, <p class ="foo2"></p>, etc.

I hope this informations helps you to give me some usefull advices, and please have in mind that first of all I want an easy understanding way, which has not to be a great performance in the first level!

Brad Gilbert
  • 33,846
  • 11
  • 78
  • 129
froehli
  • 904
  • 1
  • 11
  • 35
  • 5
    DON'T USE REGULAR EXPRESSIONS! HTML IS NOT REGULAR! – Paul Tomblin Dec 19 '11 at 23:05
  • 5
    How refreshing to see someone using an HTML parser to parse HTML instead of regexes :p +1 just for that – fge Dec 19 '11 at 23:06
  • 1
    FWIW: `my ($class) = shift->{href};` <-- means take the `href` hash member of the shifted argument. Could have been written `my $ref = shift; my $class = $ref->{"href"};` – fge Dec 19 '11 at 23:10
  • 1
    Is `HTML::Parser` a requirement? You could probably make this a lot simpler using something that implements the standard DOM methods (e.g. `HTML::TagParser`). – Wayne Dec 19 '11 at 23:13
  • But where does `href`come from? First I thought that `@_[0]` is shiftet. And it is a string which just got the name of the subroutaine. Then I guessed that it could be a Scalar variable `$foo` which is a pointer. But then the next line would put `@_[1]` into `$self` which does not make sens to me. – froehli Dec 19 '11 at 23:15
  • I'm also new to Web Developement. I know that Firebug can display some DOM thing, but thats it. – froehli Dec 19 '11 at 23:21
  • @Paul Tomblin, That's complete nonsense. Whether HTML is regular or not does not prevent the use of regular expressions. It might prevent just regular expressions from being used, but even that is doubtful since Perl regular expressions are not even close to regular. – ikegami Dec 19 '11 at 23:28
  • (The previous message should not be taken as an endorsement for using regular expressions.) – ikegami Dec 20 '11 at 23:46

4 Answers4

5

Use HTML::TokeParser::Simple.

Untested code based on your description:

#!/usr/bin/env perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new(url => 'http://example.com/example.html');

my $level;

while (my $tag = $p->get_tag('div')) {
    my $class = $tag->get_attr('class');
    next unless defined($class) and $class eq 'foo';

    $level += 1;

    while (my $token = $p->get_token) {
        $level += 1 if $token->is_start_tag('div');
        $level -= 1 if $token->is_end_tag('div');
        print $token->as_is;
        unless ($level) {
            last;
        }
    }
}
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
5

HTML::Parser is more of a tokenizer than a parser. It leaves a lot of hard work up to you. Have you considered using HTML::TreeBuilder (which uses HTML::Parser) or XML::LibXML (a great library which has support for HTML)?

ikegami
  • 367,544
  • 15
  • 269
  • 518
3

No need to get so complicated. You can retrieve and find elements in the DOM using CSS selectors with Mojo::UserAgent:

say Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo');

or, loop through the elements found:

say $_ for Mojo::UserAgent->new->get('http://f.oo')->res->dom
    ->find('div.foo')->each;

or, loop using a callback:

Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo')->each(sub {
  my ($count, $el) = @_;
  say "$count: $el";
});
Tempire
  • 2,290
  • 19
  • 14
  • Seems that my Mac does not have Mojo::UserAgent installed, which means that our webserver doesn't have this either. Same for TokeParser::Simple. But anyway. I found out that the site to parse is not proper xhtml, so I've got to take the way by my own. – froehli Dec 30 '11 at 10:42
  • Mojo::UserAgent is not part of the core, but it's simple to install: "curl -L cpanmin.us | perl - Mojolicious". If you're limiting yourself to core, you're missing out on the primary benefit of Perl, which would be unfortunate. Also, if your documents are any form of HTML at all, Mojo::DOM should handle it; it's meant for real-world usage, not strict xml tags. – Tempire Dec 31 '11 at 04:41
1

According to the docs, the handler's signature is (\%attr, \@attr_seq, $text). There are three shifts, one for each argument.

my ($class) = shift->{href};

is equivalent to:

my $class;
my %attr_seq;
my $attr_seq_ref;

$attr_seq_ref = shift;
%attr_seq = %$attr_seq_ref;
$class = $attr_seq{'href'};
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • I've got that. But what is about the condition? Does it not call another shift? And why is it just an 'a' when it starts with `... – froehli Dec 19 '11 at 23:29
  • As I said, there's three shifts in there, not two: one in the `if`, one for the attributes (one of which gets assigned to `$class`), and one for what becomes `$self`. The test condition tests for the tag name - the parser itself will take care of the `<`. – Amadan Dec 20 '11 at 00:11
  • if the `if`counts, then i see five shifts. Two of them in a condition. If there are only three, then the condition shifts don't pullt something out of the array, or? – froehli Dec 20 '11 at 11:20
  • Just three. Bottom two are in subfunctions, so they don't operate on the same `@_`. – Amadan Dec 21 '11 at 00:47