Mirod's answer is awesome. This being Perl, I'll throw another approach out there.
Let's assume you have the HTML file in input.html
. Here's a Perl program which uses the HTML::TreeBuilder
module to extract the text:
#!/usr/bin/perl
use 5.10.0 ;
use strict ;
use warnings ;
use HTML::TreeBuilder ;
my $tree = HTML::TreeBuilder -> new () ;
$tree -> parse_file ( 'input.html' ) ;
my $text = ($tree -> address ( '0.1.0.2.0.0.0.1' ) -> content_list ()) [0] ;
say $text ;
Running it:
/tmp/tmp $ ./_extract-a.pl
XYZ 81.6 (-0.1)�
So how did I come up with that '0.1.0.2.0.0.0.1' magic number? Each node in the tree that results from parsing the HTML file has an "address". The text that you are interested has the address '0.1.0.2.0.0.0.1'.
So, how do you display the node addresses? Here's a little program I call treebuilder-dump
; when you pass it an HTML file, it displays it with the nodes labeled:
#!/usr/bin/perl
use 5.10.0 ;
use strict ;
use warnings ;
use HTML::TreeBuilder ;
my $tree = HTML::TreeBuilder->new ;
if ( ! @ARGV == 1 ) { die "No file provided" ; }
if ( ! -f $ARGV[0] ) { die "File does not exist: $ARGV[0]" ; }
$tree->parse_file ( $ARGV[0] ) ;
$tree->dump () ;
$tree->delete () ;
So for example, here's the output when run on your HTML snippet:
<html> @0 (IMPLICIT)
<head> @0.0 (IMPLICIT)
<body> @0.1 (IMPLICIT)
<table border="0" width="100%"> @0.1.0
<caption valign="top"> @0.1.0.0
<p class="InfoContent"> @0.1.0.0.0
<b> @0.1.0.0.0.0
<br /> @0.1.0.0.0.0.0
<tr> @0.1.0.1
<td colspan="3"> @0.1.0.1.0
<p class="InfoContent"> @0.1.0.1.0.0
<b> @0.1.0.1.0.0.0
"ABC"
<tr> @0.1.0.2
<td height="61" valign="top" width="31%"> @0.1.0.2.0
<p class="InfoContent"> @0.1.0.2.0.0
<b> @0.1.0.2.0.0.0
" "
<font color="#0000FF"> @0.1.0.2.0.0.0.1
"XYZ 81.6 (-0.1)�"
<br /> @0.1.0.2.0.0.0.1.1
"22/06/2011"
" "
You can see that the text you're interested in is located within the font color
node which has address 0.1.0.2.0.0.0.1
.