perl parse html tree buidler or element or parser

Question

I'm trying to extract some information html using perl. I found out about TreeBuilder and Element and Parser, which one should i use? How would I extract the name and the value of the row below? Also this is embedded in an html structure, the only way to really target which field I want is given the value of the column "Number of directories". Or should I just do a regex on the entire html?

<table cellspacing="0">
    <tbody><tr><td class="black">Number of directories</td><td class="black">:</td><td class="black">&nbsp;80</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;monitored&nbsp;source&nbsp;files</td><td class="black">:</td><td class="black">&nbsp;425</td></tr>
        <tr><td class="black">Number of functions</td><td class="black">:</td><td class="black">&nbsp;6245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;source&nbsp;lines</td><td class="black">:</td><td class="black">&nbsp;3245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;measurement&nbsp;points</td><td class="black">:</td><td class="black">&nbsp;2457</td></tr>
        <tr><td class="red">TER</td><td class="red">:</td><td class="red">&nbsp;<strong>12%</strong>&nbsp;(decision)</td></tr>
    </tbody></table>

Whoever made that HTML does not understand the purpose of CSS: `class='red'` indeed. Too bad, because code with good structural CSS is easier to work with. — daotoad, Jun 14 '11 at 02:02

score 1 · Answer 1 · answered Jun 14 '11 at 02:41

1

If you need to extract data from an HTML table, then

use HTML::TableExtract;

would be a good choice.

answered Jun 14 '11 at 02:41

tadmc

3,714
16
14

1

Unfortunately, HTML::TableExtract is oriented towards tables with headers across the top instead of down the left side (like this table is). – cjm Jun 17 '11 at 02:29

score 0 · Answer 2 · edited May 23 '17 at 12:06

0

Of course everyone is going to have their own favorite. I prefer HTML::TokeParser, I find it easy to understand and use (once you get over the hump of how the return arrays work). Of course I have to point you to the SO classic post, reminding you to please not parse HTML with regular expressions.

edited May 23 '17 at 12:06

Community

1
1

answered Jun 14 '11 at 01:13

Joel Berger

20,180
5
49
104

score 0 · Accepted Answer · answered Jun 14 '11 at 02:25

There are a few steps.

Use one of HTML::TreeBuilder's constructors to parse the HTML.
Convert the HTML::TreeBuilder object at the root into an HTML::Element by calling elementify.
Understand the structure of your HTML well enough that you can tell HTML::Element::look_down() how to find the bits you are interested in. You can specify criteria in almost any form imaginable.
Use HTML::Element::look_down(), content_list(), left(), right() and related methods to traverse the area of interest and extract data. DO NOT USE traverse()--it was a bad idea.
Pass the data you collected to whatever system asked for it in the first place.

Here's some code:

my $blarg = <<'END_HTML';
<table cellspacing="0">
    <tbody><tr><td class="black">Number of directories</td><td class="black">:</td><td class="black">&nbsp;80</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;monitored&nbsp;source&nbsp;files</td><td class="black">:</td><td class="black">&nbsp;425</td></tr>
        <tr><td class="black">Number of functions</td><td class="black">:</td><td class="black">&nbsp;6245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;source&nbsp;lines</td><td class="black">:</td><td class="black">&nbsp;3245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;measurement&nbsp;points</td><td class="black">:</td><td class="black">&nbsp;2457</td></tr>
        <tr><td class="red">TER</td><td class="red">:</td><td class="red">&nbsp;<strong>12%</strong>&nbsp;(decision)</td></tr>
    </tbody></table>
END_HTML

# Use any of the constructors to get your base object.  See the pod.
my $tree = HTML::TreeBuilder->new_from_content($blarg);

$tree->elementify;  # Make it just a plain HTML::Element object.

# Iterate over a list of rows:  look_down and related functions provide powerful ways to find matching elements.  Read the pod for more details.
my %crud_from_table;
for my $row ( $tree->look_down( _tag => 'tr' ) ) {
    my ($key, $value) = map $_->as_text, $row->content_list;  # assumes two td per row.
    $crud_from_table{$key} = $value;
}

The most important part lies in understanding and being able to describe to look_down() how to find your desired information. Sometimes you can zoom right to it by matching an id. Other times you have to look for the third div of class 'foo' with a table in it. This is also the hardest and the part that I can help you with the least. You are just going to have to experiment.

Good luck.

thank you that helped so much, one problem my output is super nasty, why isn't as_text just giving me the string without html? $VAR1 = 'Numberáofásourceálines'; $VAR2 = 'á23182'; $VAR5 = 'Coverageáview'; $VAR6 = 'áAsáinstrumented'; $VAR9 = 'Thresholdápercent'; $VAR10 = 'á80á%'; $VAR11 = 'Number of directories'; — user391986, Jun 14 '11 at 17:22
@user391986, It's probably non-breaking spaces causing you the pain. Use `->as_trimmed_text` instead. — daotoad, Jun 14 '11 at 19:40
I ended up doing $testValue =~ s/\x{a0}//g; is that bad? It's the value that was shown when I did a dump. — user391986, Jun 14 '11 at 23:37
@user391986, that works, so it's not terrible. The HTML::Entity docs are lame about this, but you can filter out the nbsp with `$e->as_text( extra_chars => '\xA0' );` If the docs or my memory were better, I could have told you that sooner, or you could have figured it out. So it goes. I'm glad I was able to help. — daotoad, Jun 15 '11 at 04:10
@Arunav_Sanyal, elementify is a method on the HTML::TreeBuilder object. It turns the root of your HTML::Tree into an element, rather than a builder. — daotoad, May 13 '15 at 22:30

perl parse html tree buidler or element or parser

3 Answers3