how to extract specific information from html webpage using perl

Question

if the information of "XYZ 81.6 (-0.1)" needed to be extracted from one html webpage, how can it be done with perl? Many thanks.

<table border="0" width="100%">
          <caption valign="top">
            <p class="InfoContent"><b><br></b>
          </caption>
          <tr>
            <td colspan="3"><p class="InfoContent"><b>ABC</b></td>
          </tr>
          <tr>
            <td valign="top" height="61" width="31%">
              <p class="InfoContent"><b><font color="#0000FF">XYZ 81.6 (-0.1)&nbsp;<br>22/06/2011</font></b></p>
            </td>
          </tr></table>

score 4 · Answer 1 · edited Jun 23 '11 at 13:34

4

I would use HTML::TreeBuilder::XPath for this (and yes, it is a shameless plug!):

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $t= HTML::TreeBuilder::XPath->new_from_file( shift @ARGV);

my $text= $t->findvalue( '//p[@class="InfoContent"]/b/font[@color="#0000FF"]');

$text=~ s{\).*}{)};

print "found '$text'\n";

It is quite fragile though: as far as I can tell the only way to narrow down the XPath expression to just what you want is to use the font tag. That is likely to change in the future, so if (when!) the code breaks, that's where you'll have to look first.

edited Jun 23 '11 at 13:34

Konerak

39,272
12
98
118

answered Jun 23 '11 at 13:22

mirod

15,923
3
45
65

This is the only answer that actually offers a concrete solution :) – Konerak Jun 23 '11 at 13:33
Yep, sorry about that, maybe I should have just linked to the usual http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – mirod Jun 23 '11 at 13:42

score 0 · Answer 2 · answered Jun 23 '11 at 16:54

You can use something like that:

bash-3.2$ perl -MLWP::Simple -le ' $current_value = get("http://stackoverflow.com/questions/6454398/how-to-extract-specific-information-from-html-webpage-using-perl"); if ($current_value=~/(XYZ\s\d+\.\d+\s\(.*?\))/s) { print "Matched pattern is:\t $1";} '
Matched pattern is:      XYZ 81.6 (-0.1)

dharmatech · Answer 3 · 2011-06-23T17:39:27.717

Mirod's answer is awesome. This being Perl, I'll throw another approach out there.

Let's assume you have the HTML file in input.html. Here's a Perl program which uses the HTML::TreeBuilder module to extract the text:

#!/usr/bin/perl

use 5.10.0 ;
use strict ;
use warnings ;

use HTML::TreeBuilder ;

my $tree = HTML::TreeBuilder -> new () ;

$tree -> parse_file ( 'input.html' ) ;

my $text = ($tree -> address ( '0.1.0.2.0.0.0.1' ) -> content_list ()) [0] ;

say $text ;

Running it:

/tmp/tmp $ ./_extract-a.pl 
XYZ 81.6 (-0.1)�

So how did I come up with that '0.1.0.2.0.0.0.1' magic number? Each node in the tree that results from parsing the HTML file has an "address". The text that you are interested has the address '0.1.0.2.0.0.0.1'.

So, how do you display the node addresses? Here's a little program I call treebuilder-dump; when you pass it an HTML file, it displays it with the nodes labeled:

#!/usr/bin/perl

use 5.10.0 ;
use strict ;
use warnings ;

use HTML::TreeBuilder ;

my $tree = HTML::TreeBuilder->new ;

if ( ! @ARGV == 1 ) { die "No file provided" ; }

if ( ! -f $ARGV[0] ) { die "File does not exist: $ARGV[0]" ; }

$tree->parse_file ( $ARGV[0] ) ;

$tree->dump () ;

$tree->delete () ;

So for example, here's the output when run on your HTML snippet:

<html> @0 (IMPLICIT)
  <head> @0.0 (IMPLICIT)
  <body> @0.1 (IMPLICIT)
    <table border="0" width="100%"> @0.1.0
      <caption valign="top"> @0.1.0.0
        <p class="InfoContent"> @0.1.0.0.0
          <b> @0.1.0.0.0.0
            <br /> @0.1.0.0.0.0.0
      <tr> @0.1.0.1
        <td colspan="3"> @0.1.0.1.0
          <p class="InfoContent"> @0.1.0.1.0.0
            <b> @0.1.0.1.0.0.0
              "ABC"
      <tr> @0.1.0.2
        <td height="61" valign="top" width="31%"> @0.1.0.2.0
          <p class="InfoContent"> @0.1.0.2.0.0
            <b> @0.1.0.2.0.0.0
              " "
              <font color="#0000FF"> @0.1.0.2.0.0.0.1
                "XYZ 81.6 (-0.1)�"
                <br /> @0.1.0.2.0.0.0.1.1
                "22/06/2011"
              " "

You can see that the text you're interested in is located within the font color node which has address 0.1.0.2.0.0.0.1.

how to extract specific information from html webpage using perl

3 Answers3