Parse html using Perl

Question

I have the following HTML-

<div>
   <strong>Date: </strong>
       19 July 2011
</div>

I have been using HTML::TreeBuilder to parse out particular parts of html that are using either tags or classes however the aforementioned html is giving me difficulty in trying to extract the date only.

For instance I tried-

for ( $tree->look_down( '_tag' => 'div'))
{ 
my $date  = $_->look_down( '_tag' => 'strong' )->as_trimmed_text;

But that seems to conflict with an earlier use of <strong>. I am looking to parse out just the '19 July 2011'. I have read the documentation on TreeBuilder but can not find a way of doing this.

How can I do this using TreeBuilder?

Dave Cross · Accepted Answer · 2011-07-21T16:04:57.560

3

The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.

The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse(<<END_OF_HTML);
<div>
   <strong>Date: </strong>
       19 July 2011
</div>
END_OF_HTML

my $date;

for my $div ($tree->look_down( _tag => 'div')) {
  for ($div->content_list) {
    $date = $_ unless ref;
  }
}

print "$date\n";

edited Jul 21 '11 at 16:04

answered Jul 21 '11 at 13:30

Dave Cross

68,119
3
51
97

looks good but is there a way around having to hard code the html? In that if I were reading from an html file would I just open 'foo.csv'? – Ebikeneser Jul 21 '11 at 15:31
Sorry, that was just there for demonstration purposes. I assumed that you knew how to parse data with HTML::TreeBuilder. The HTML::TreeBuilder object has a parse_file method (as you'll see in the documentation). – Dave Cross Jul 21 '11 at 16:04

score 2 · Answer 2 · answered Jul 21 '11 at 12:57

It looks like HTML::Element::content_list() is the function you want. Descendant nodes will be objects while text will just be text, so you can filter with ref() to just get the text part(s).

for ($tree->find('div')) {
  my @content = grep { ! ref } $_->content_list;
  # @content now contains just the bare text portion of the tag
}

Alan Haggai Alavi · Answer 3 · 2011-07-21T12:52:44.477

1

You could work around it by removing the text within <strong> from <div>:

my $div      = $tree->look_down( '_tag' => 'div' );
my $div_text = $div->as_trimmed_text;
if ( my $strong = $div->look_down( '_tag' => 'strong' ) ) {
    my $strong_text = $strong->as_trimmed_text;
    my $date        = $div_text;
    $date =~ s/$strong_text\s*//;
}

edited Jul 21 '11 at 12:52

answered Jul 21 '11 at 10:18

Alan Haggai Alavi

72,802
19
102
127

It says that it cant call method on undefined value on the 'my $strong_text = $div->look_down( '_tag' => 'strong' )->as_trimmed_text;' line. Baring in mind this is using a 'for' loop - 'for ( $tree->look_down( '_tag' => 'div')) { ' perhaps that is causing the error? – Ebikeneser Jul 21 '11 at 10:44
It should be fine to use `look_down` in a `for` loop. Can you please provide a sample of the HTML (with multiple `div` and `strong` elements) that you are trying to parse? – Alan Haggai Alavi Jul 21 '11 at 11:34
Premium !

They are frozen.
– Ebikeneser Jul 21 '11 at 11:56
I have updated my code with a check to see if a `` exists within a `
` or not.
– Alan Haggai Alavi Jul 21 '11 at 12:53
there seems to be a syntax error stating a mising '}' but it still doesnt seem to pick up what I want. – Ebikeneser Jul 21 '11 at 13:27
That is probably an error in *your* code. By the way, have a look at **[RickF](http://stackoverflow.com/users/447771/rickf)**'s [answer](http://stackoverflow.com/questions/6774223/parse-html-using-perl/6776391#6776391). – Alan Haggai Alavi Jul 21 '11 at 13:30

Parse html using Perl

3 Answers3

Premium !

Linked