4

So I have a reporting tool that spits out job scheduling statistics in an HTML file, and I'm looking to consume this data using Perl. I don't know how to step through a HTML table though.

I know how to do this with jQuery using

$.find('<tr>').each(function(){
  variable = $(this).find('<td>').text
});

But I don't know how to do this same logic with Perl. What should I do? Below is a sample of the HTML output. Each table row includes the three same stats: object name, status, and return code.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
<HTML>
<HEAD>
<meta name="GENERATOR" content="UC4 Reporting Tool V8.00A">
<Title></Title>
<style type="text/css">
th,td {
font-family: arial;
font-size: 0.8em;
}

th {
background: rgb(77,148,255);
color: white;
}

td {
border: 1px solid rgb(208,213,217);
}  

table {
border: 1px solid grey; 
background: white;
}

body {
background: rgb(208,213,217);
}
</style>
</HEAD>
<BODY>
<table>
<tr>
  <th>Object name</th>
  <th>Status</th>
  <th>Return code</th>
</tr>
<tr>
  <td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
  <td>ENDED_OK - ended normally</td>
  <td>0</td>
</tr>
<tr>
  <td>JOBS.UNIX.ADMIN.INFA_CHK_REP_SERVICE</td>
  <td>ENDED_OK - ended normally</td>
  <td>0</td>
</tr>
<tr>
  <td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
  <td>ENDED_OK - ended normally</td>
  <td>0</td>
</tr>
Mark Cheek
  • 265
  • 2
  • 10
  • 20

5 Answers5

11

The HTML::Query module is a wrapper around the HTML parser that provides a querying interface that is familiar to jQuery users. So you could write something like

use HTML::Query qw(Query);
my $docName = "test.html";
my $doc = Query(file => $docName);

for my $tr ($doc->query("td")) {
  for my $td (Query($tr)->query("td")) {
    # $td is now an HTML::Element object for the td element
    print $td->as_text, "\n";
  }
}

Read the HTML::Query documentation to get a better idea of how to use it--- the above is hardly the prettiest example.

araqnid
  • 127,052
  • 24
  • 157
  • 134
  • Oh hey, shiny thing! I didn't know about [HTML::Query](https://metacpan.org/module/HTML::Query) before. The questioner might have an easier time using the `text` or `file` parameter rather than the `tree` parameter, though. `tree` expects an [HTML::Element](https://metacpan.org/module/HTML::Element) object. – Brian Wisti Sep 30 '11 at 16:27
  • @BrianWisti nice and it installs cleanly, this should be the accepted answer. araqnid can you add the missing part on top of your source? So the example will be complete: use HTML::Query; use HTML::TreeBuilder; my $docName = "test.html"; my $doc = HTML::TreeBuilder->new; $doc->parse_file($docName); – stivlo Sep 30 '11 at 17:09
  • This might be a better solution than mine. Especially, if you can't get HTML::TableContentParser to install. It is pretty out of date. – aus Sep 30 '11 at 18:08
9

You could use a RegExp but Perl already has modules built for this specific task. Check out HTML::TableContentParser

You would probably do something like this:

use HTML::TableContentParser;

$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);

foreach $table (@$tables) {
  foreach $row (@{ $tables->{rows} }) {
    foreach $col (@{ $row->{cols} }) {
      # each <td>
      $data = $col->{data};
    }
  }
}
aus
  • 1,394
  • 1
  • 14
  • 19
  • 4
    using regex for html is the root of all evil. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – CountMurphy Sep 30 '11 at 15:53
  • 2
    You *could* use a regexp, but yes... its a terrible, very bad idea. – aus Sep 30 '11 at 15:57
  • 2
    @CountMurphy I love [that answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). It's sublimated art. – stivlo Sep 30 '11 at 16:02
  • @stivlo that just made my day – CountMurphy Sep 30 '11 at 16:10
  • 1
    The only problem with the [HTML::TableContentParser](https://metacpan.org/module/HTML::TableContentParser) solution is that the module hasn't been updated since 2002 and won't install on my Perl (5.14.1 on OS X) without forcing, thanks to the way tests are laid out. Unless it's available as a package for whatever flavor of Perl the questioner is using on whatever platform, they aren't going to get far with it. – Brian Wisti Sep 30 '11 at 16:17
  • @BrianWisti, you're right, I also have failed tests with Perl v5.10.1 on Ubuntu Natty Narwhal. – stivlo Sep 30 '11 at 16:52
  • 1
    So, fix the module. It's Open Source. You get some benefit, and you can pay it back to the community. :) – brian d foy Sep 30 '11 at 19:38
3

Here I use the HTML::Parser, is a little verbose, but guaranteed to work. I am using the diamond operator so, you can use it as a filter. If you call this Perl source extractTd, here are a couple of ways to call it.

$ extractTd test.html

or

$ extractTd < test.html

will both work, output will go on standard output and you can redirect it to a file.

#!/usr/bin/perl -w

use strict;

package ExtractTd;
use 5.010;
use base "HTML::Parser";

my $td_flag = 0;

sub start {
    my ($self, $tag, $attr, $attrseq, $origtext) = @_; 
    if ($tag =~ /^td$/i) {
        $td_flag = 1;
    }   
}

sub end {
    my ($self, $tag, $origtext) = @_; 
    if ($tag =~ /^td$/i) {
        $td_flag = 0;
    }   
}

sub text {
    my ($self, $text) = @_; 
    if ($td_flag) {
        say $text;
    }   
}

my $extractTd = new ExtractTd;
while (<>) {
    $extractTd->parse($_);
}
$extractTd->eof;
stivlo
  • 83,644
  • 31
  • 142
  • 199
2

Have you tried looking at cpan for HTML libraries? This seems to do what your wanting http://search.cpan.org/~msisk/HTML-TableExtract-2.11/lib/HTML/TableExtract.pm

Also here is a whole page of different HTML related libraries to use http://search.cpan.org/search?m=all&q=html+&s=1&n=100

CountMurphy
  • 1,086
  • 2
  • 18
  • 39
2

Perl CPAN module HTML::TreeBuilder.

I use it extensively to parse a lot of HTML documents.

The concept is that you get an HTML::Element (the root node by example). From it, you can look for other nodes:

  • Get a list of children nodes with ->content_list()
  • Get the parent node with ->parent()

Disclaimer: The following code has not been tested, but it's the idea.

my $root = HTML::TreeBuilder->new;
$root->utf8_mode(1);
$root->parse($content);
$root->eof();
# This gets you an HTML::Element, of the root document
$root->elementify();

my @td = $root->look_down("_tag", "td");
foreach my $td_elem (@td)
{
    printf "-> %s\n", $td_elem->as_trimmed_text();
}

If your table is more complex than that, you could first find the TABLE element, then iterate over each TR children, and for each TR children, iterate over TD elements...

http://metacpan.org/pod/HTML::TreeBuilder

szabgab
  • 6,202
  • 11
  • 50
  • 64
folays
  • 231
  • 2
  • 6
  • I assume that '$content' is the HTML file. Forgive me, my Perl knowledge is small, but what would be my declaration statement for '$content'? (i.e. 'my $content = '? – Mark Cheek Oct 06 '11 at 17:19