-1

I need to read many HTML files containing similar structure using perl.

The structure consists of STRRRR...E

  • S=html header just before table begins
  • T=unique table start structure in the html file(I can identify it)
  • R=Group of html elements(those are tr's, I can identify it too)
  • E=All remaining - singnifies end R's

I want to extract all R's in array using single line "m" perlop.

I'm looking for something like this:

@all_Rs = $htmlfile=~m{ST(R)*E}gs;

But it has never worked out.

Until now I've been doing round about way to do it like using deleting unwanted text, for loop etc. I want to extract all rows from this page: http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx and there are many such pages.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
AgA
  • 2,078
  • 7
  • 33
  • 62
  • 10
    This is th 3rd or 4th time this day that someone wants to do something with regexps, but insists on doing it with one, singel, glorious regex. Is that a sport, or what? I for my part giving up. Only so much: This is as reasonable as to insist of having a complex functionality in one big expression (rather than functions, modules, etc.) – Ingo Mar 25 '11 at 14:38
  • 4
    Don't parse HTML with regexps in the first place: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – reinierpost Mar 25 '11 at 14:57
  • 1
    Do you have S, T, R and E regular expressions? If yes and they are working alone, you can combine them together like you outlined. – bvr Mar 25 '11 at 15:01
  • As I've commented below, I can't use dom parser because of many errors this html page contains. – AgA Mar 25 '11 at 17:10
  • 1
    @Ingo I like your phrase "Is that a sport?" Maybe it is, actually golf. – sawa Mar 25 '11 at 17:16
  • 1
    @user656848, that should be a good clue as to why regular expressions are also not going to get you what you want. Bad pages tend to either get worse or get fixed over time, either way your regular expressions will break. – Ven'Tatsu Mar 25 '11 at 17:52

3 Answers3

5

Regex is the wrong tool. Use an HTML parser.

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content(<<'END_OF_HTML');
<html>
    <table>
        <tr>1
        <tr>2
        <tr>3
        <tr>4
        <tr>5
    </table>
</html>
END_OF_HTML

print $_->as_text for $tree->findnodes('//tr');

HTML::TreeBuilder::XPath inherits from HTML::TreeBuilder.

daxim
  • 39,270
  • 4
  • 65
  • 132
  • No no. I'm working on html files which are syntactically not correct and many open close tags are missing. That's why I'm not using DOM to traverse the tree. I want to extract all the rows of :http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx . Please see that this page has plenty of html errors and is the worst page I've seen in my life. – AgA Mar 25 '11 at 17:07
2

daxim is right about using a real parser. My personal choice is XML::LibXML.

use XML::LibXML
my $parser = XML::LibXML->new();
$parser->recover(1);                 # don't fail on parsing errors
my $doc = do { 
    local $SIG{__WARN__} = sub {};   # silence warning about parsing errors
    $parser->parse_html_file('http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx');
};

print $_->toString() for $doc->findnodes('//tr[td[1][@class="td_background"]]');

This gets me each station row from that page.

For a bit more work we can have a nice data structure to hold the text in each cell.

use Data::Dumper;
my @data = map {
    my $row = $_;
    [ map {
        $_->findvalue('normalize-space(text())');
    } $row->findnodes('td') ]
} $doc->findnodes('//tr[td[1][@class="td_background"]]');
print Dumper \@data;
Ven'Tatsu
  • 3,565
  • 16
  • 18
  • where can I download XML module along with the documentation? – AgA Mar 25 '11 at 18:42
  • The link in my answer will lead to a CPAN page with documentation, and files you can download for manual installation. If you are on Linux or a similar system your distribution might have a package that can be installed, otherwise if you have root access the `cpan` command can be used e.g. `cpan XML::LibXML`. If your on Windows using ActivePerl it comes with a tool `ppm` that should be able to install XML::LibXML. – Ven'Tatsu Mar 25 '11 at 21:32
2

If you want to process an HTML table, consider using a module that knows how to process HTML tables!

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;


my $html = get 'http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx';
$html =~ s/&nbsp;/ /g;

my $te = new HTML::TableExtract( depth => 1, count => 2 );
$te->parse($html);
foreach my $ts ($te->table_states) {
   foreach my $row ($ts->rows) {
      next if $row->[0] =~ /^\s*(Next|Station)/;
      next if $row->[4] =~ /^\s*(ARR\/DEP|RESERVATION)/;
      foreach my $cell (@$row) {
          $cell =~ s/^\s+//;
          $cell =~ s/\s+$//;
          print "$cell\n";
      }
      print "\n";
   }
}
tadmc
  • 3,714
  • 16
  • 14