2

I'm trying to extract some information html using perl. I found out about TreeBuilder and Element and Parser, which one should i use? How would I extract the name and the value of the row below? Also this is embedded in an html structure, the only way to really target which field I want is given the value of the column "Number of directories". Or should I just do a regex on the entire html?

<table cellspacing="0">
    <tbody><tr><td class="black">Number of directories</td><td class="black">:</td><td class="black">&nbsp;80</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;monitored&nbsp;source&nbsp;files</td><td class="black">:</td><td class="black">&nbsp;425</td></tr>
        <tr><td class="black">Number of functions</td><td class="black">:</td><td class="black">&nbsp;6245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;source&nbsp;lines</td><td class="black">:</td><td class="black">&nbsp;3245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;measurement&nbsp;points</td><td class="black">:</td><td class="black">&nbsp;2457</td></tr>
        <tr><td class="red">TER</td><td class="red">:</td><td class="red">&nbsp;<strong>12%</strong>&nbsp;(decision)</td></tr>
    </tbody></table>
user391986
  • 29,536
  • 39
  • 126
  • 205
  • 1
    Whoever made that HTML does not understand the purpose of CSS: `class='red'` indeed. Too bad, because code with good structural CSS is easier to work with. – daotoad Jun 14 '11 at 02:02

3 Answers3

1

If you need to extract data from an HTML table, then

use HTML::TableExtract;

would be a good choice.

tadmc
  • 3,714
  • 16
  • 14
  • 1
    Unfortunately, HTML::TableExtract is oriented towards tables with headers across the top instead of down the left side (like this table is). – cjm Jun 17 '11 at 02:29
0

Of course everyone is going to have their own favorite. I prefer HTML::TokeParser, I find it easy to understand and use (once you get over the hump of how the return arrays work). Of course I have to point you to the SO classic post, reminding you to please not parse HTML with regular expressions.

Community
  • 1
  • 1
Joel Berger
  • 20,180
  • 5
  • 49
  • 104
0

There are a few steps.

  1. Use one of HTML::TreeBuilder's constructors to parse the HTML.
  2. Convert the HTML::TreeBuilder object at the root into an HTML::Element by calling elementify.
  3. Understand the structure of your HTML well enough that you can tell HTML::Element::look_down() how to find the bits you are interested in. You can specify criteria in almost any form imaginable.
  4. Use HTML::Element::look_down(), content_list(), left(), right() and related methods to traverse the area of interest and extract data. DO NOT USE traverse()--it was a bad idea.
  5. Pass the data you collected to whatever system asked for it in the first place.

Here's some code:

my $blarg = <<'END_HTML';
<table cellspacing="0">
    <tbody><tr><td class="black">Number of directories</td><td class="black">:</td><td class="black">&nbsp;80</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;monitored&nbsp;source&nbsp;files</td><td class="black">:</td><td class="black">&nbsp;425</td></tr>
        <tr><td class="black">Number of functions</td><td class="black">:</td><td class="black">&nbsp;6245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;source&nbsp;lines</td><td class="black">:</td><td class="black">&nbsp;3245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;measurement&nbsp;points</td><td class="black">:</td><td class="black">&nbsp;2457</td></tr>
        <tr><td class="red">TER</td><td class="red">:</td><td class="red">&nbsp;<strong>12%</strong>&nbsp;(decision)</td></tr>
    </tbody></table>
END_HTML

# Use any of the constructors to get your base object.  See the pod.
my $tree = HTML::TreeBuilder->new_from_content($blarg);

$tree->elementify;  # Make it just a plain HTML::Element object.

# Iterate over a list of rows:  look_down and related functions provide powerful ways to find matching elements.  Read the pod for more details.
my %crud_from_table;
for my $row ( $tree->look_down( _tag => 'tr' ) ) {
    my ($key, $value) = map $_->as_text, $row->content_list;  # assumes two td per row.
    $crud_from_table{$key} = $value;
}

The most important part lies in understanding and being able to describe to look_down() how to find your desired information. Sometimes you can zoom right to it by matching an id. Other times you have to look for the third div of class 'foo' with a table in it. This is also the hardest and the part that I can help you with the least. You are just going to have to experiment.

Good luck.

daotoad
  • 26,689
  • 7
  • 59
  • 100
  • thank you that helped so much, one problem my output is super nasty, why isn't as_text just giving me the string without html? $VAR1 = 'Numberáofásourceálines'; $VAR2 = 'á23182'; $VAR5 = 'Coverageáview'; $VAR6 = 'áAsáinstrumented'; $VAR9 = 'Thresholdápercent'; $VAR10 = 'á80á%'; $VAR11 = 'Number of directories'; – user391986 Jun 14 '11 at 17:22
  • @user391986, It's probably non-breaking spaces causing you the pain. Use `->as_trimmed_text` instead. – daotoad Jun 14 '11 at 19:40
  • I ended up doing $testValue =~ s/\x{a0}//g; is that bad? It's the value that was shown when I did a dump. – user391986 Jun 14 '11 at 23:37
  • @user391986, that works, so it's not terrible. The HTML::Entity docs are lame about this, but you can filter out the nbsp with `$e->as_text( extra_chars => '\xA0' );` If the docs or my memory were better, I could have told you that sooner, or you could have figured it out. So it goes. I'm glad I was able to help. – daotoad Jun 15 '11 at 04:10
  • elementify is not an object in HTML::TreeBuilder – Arunav Sanyal May 12 '15 at 19:19
  • @Arunav_Sanyal, elementify is a method on the HTML::TreeBuilder object. It turns the root of your HTML::Tree into an element, rather than a builder. – daotoad May 13 '15 at 22:30