How to parse between
and
easily in Perl

Question

I want to parse a Website into a Perl data structure. First I load the page with

use LWP::Simple;
my $html = get("http://f.oo");

Now I know two ways to deal with it. First are the regular expressions and secound the modules.

I started with reading about HTML::Parser and found some examples. But I'm not that sure about by Perl knowledge.

My code example goes on

my @links;

my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse($html);

foreach my $link(@links){
  print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}

sub start_handler{
  return if(shift ne 'a');
  my ($class) = shift->{href};
  my $self = shift;
  my $text;
  $self->handler(text => sub{$text = shift;},"dtext");
  $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname");
}

I don't understand why there is two times a shift. The secound should be the self pointer. But the first makes me think that the self reference is allready shiftet, used as a Hash and the Value for href is stored in $class. Could someone Explain this line (my ($class) = shift->{href};)?

Beside this lack, I do not want to parse all the URLs, I want to put all the code between <div class ="foo"> and </div> into a string, where lots of code is between, specially other <div></div> tags. So I or a module has to find the right end. After that I planed to scan the string again, to find special classes, like <h1>,<h2>, <p class ="foo2"></p>, etc.

I hope this informations helps you to give me some usefull advices, and please have in mind that first of all I want an easy understanding way, which has not to be a great performance in the first level!

How refreshing to see someone using an HTML parser to parse HTML instead of regexes :p +1 just for that — fge, Dec 19 '11 at 23:06
FWIW: `my ($class) = shift->{href};` <-- means take the `href` hash member of the shifted argument. Could have been written `my $ref = shift; my $class = $ref->{"href"};` — fge, Dec 19 '11 at 23:10
Is `HTML::Parser` a requirement? You could probably make this a lot simpler using something that implements the standard DOM methods (e.g. `HTML::TagParser`). — Wayne, Dec 19 '11 at 23:13
But where does `href`come from? First I thought that `@_[0]` is shiftet. And it is a string which just got the name of the subroutaine. Then I guessed that it could be a Scalar variable `$foo` which is a pointer. But then the next line would put `@_[1]` into `$self` which does not make sens to me. — froehli, Dec 19 '11 at 23:15
I'm also new to Web Developement. I know that Firebug can display some DOM thing, but thats it. — froehli, Dec 19 '11 at 23:21
@Paul Tomblin, That's complete nonsense. Whether HTML is regular or not does not prevent the use of regular expressions. It might prevent just regular expressions from being used, but even that is doubtful since Perl regular expressions are not even close to regular. — ikegami, Dec 19 '11 at 23:28
(The previous message should not be taken as an endorsement for using regular expressions.) — ikegami, Dec 20 '11 at 23:46

Sinan Ünür · Answer 1 · 2011-12-19T23:37:46.863

Use HTML::TokeParser::Simple.

Untested code based on your description:

#!/usr/bin/env perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new(url => 'http://example.com/example.html');

my $level;

while (my $tag = $p->get_tag('div')) {
    my $class = $tag->get_attr('class');
    next unless defined($class) and $class eq 'foo';

    $level += 1;

    while (my $token = $p->get_token) {
        $level += 1 if $token->is_start_tag('div');
        $level -= 1 if $token->is_end_tag('div');
        print $token->as_is;
        unless ($level) {
            last;
        }
    }
}

score 5 · Answer 2 · answered Dec 19 '11 at 23:36

5

HTML::Parser is more of a tokenizer than a parser. It leaves a lot of hard work up to you. Have you considered using HTML::TreeBuilder (which uses HTML::Parser) or XML::LibXML (a great library which has support for HTML)?

answered Dec 19 '11 at 23:36

ikegami

367,544
15
269
518

score 3 · Answer 3 · answered Dec 25 '11 at 02:40

3

No need to get so complicated. You can retrieve and find elements in the DOM using CSS selectors with Mojo::UserAgent:

say Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo');

or, loop through the elements found:

say $_ for Mojo::UserAgent->new->get('http://f.oo')->res->dom
    ->find('div.foo')->each;

or, loop using a callback:

Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo')->each(sub {
  my ($count, $el) = @_;
  say "$count: $el";
});

answered Dec 25 '11 at 02:40

Tempire

2,290
19
14

Seems that my Mac does not have Mojo::UserAgent installed, which means that our webserver doesn't have this either. Same for TokeParser::Simple. But anyway. I found out that the site to parse is not proper xhtml, so I've got to take the way by my own. – froehli Dec 30 '11 at 10:42
Mojo::UserAgent is not part of the core, but it's simple to install: "curl -L cpanmin.us | perl - Mojolicious". If you're limiting yourself to core, you're missing out on the primary benefit of Perl, which would be unfortunate. Also, if your documents are any form of HTML at all, Mojo::DOM should handle it; it's meant for real-world usage, not strict xml tags. – Tempire Dec 31 '11 at 04:41

score 1 · Answer 4 · edited Dec 19 '11 at 23:32

1

According to the docs, the handler's signature is (\%attr, \@attr_seq, $text). There are three shifts, one for each argument.

my ($class) = shift->{href};

is equivalent to:

my $class;
my %attr_seq;
my $attr_seq_ref;

$attr_seq_ref = shift;
%attr_seq = %$attr_seq_ref;
$class = $attr_seq{'href'};

edited Dec 19 '11 at 23:32

Sinan Ünür

116,958
15
196
339

answered Dec 19 '11 at 23:11

Amadan

191,408
23
240
301

I've got that. But what is about the condition? Does it not call another shift? And why is it just an 'a' when it starts with `... – froehli Dec 19 '11 at 23:29
As I said, there's three shifts in there, not two: one in the `if`, one for the attributes (one of which gets assigned to `$class`), and one for what becomes `$self`. The test condition tests for the tag name - the parser itself will take care of the `<`. – Amadan Dec 20 '11 at 00:11
if the `if`counts, then i see five shifts. Two of them in a condition. If there are only three, then the condition shifts don't pullt something out of the array, or? – froehli Dec 20 '11 at 11:20
Just three. Bottom two are in subfunctions, so they don't operate on the same `@_`. – Amadan Dec 21 '11 at 00:47

How to parse between
and
easily in Perl

4 Answers4

Linked

How to parse between and easily in Perl

4 Answers4

Linked

How to parse between
and
easily in Perl