0

I have an HTML file containing a 2-column table which I want to parse in order to extract pairs of strings representing the columns. The page layout of the HTML (white space, new lines) is arbitrary, hence I can't parse the file line by line.

I recall that you can parse such a thing by slurping the whole file into a string and operating on the entire string, which I'm finding a bit more challenging. I'm trying things like the following:

#!/usr/bin/perl

open(FILE, "Glossary") || die "Couldn't open file\n";
@lines = <FILE>;
close(FILE);

$data = join(' ', @lines);

while ($data =~ /<tr>.*(<td>.*<\/td>).*(<td>.*<\/td>).*<\/tr>/g) {
    print $1, ":", $2, "\n";
}

which gives a null output. Here's a section of the input file:

<table class="wikitable">
    <tr>
        <td><b>Term</b>
        </td>
        <td><b>Meaning</b>
        </td></tr>
    <tr>
        <td><span id="0-Day">0-Day</span>
        </td>
        <td>
        <p>See <a href="#Zero_Day">Zero Day</a>.
        </p>
        </td>

Can someone help me out?

Ivar
  • 6,138
  • 12
  • 49
  • 61
pleriche
  • 61
  • 5
  • 2
    Use `HTML::TableExtract` – Borodin Nov 12 '17 at 21:20
  • 1
    To correct my early comment (removed), while I recommend [HTML::TreeBuilder](http://search.cpan.org/~kentnl/HTML-Tree-5.07/lib/HTML/TreeBuilder.pm) for general parsing of HTML (and there are others), here you indeed want `HTML::TableExtract`. And you _do not_ want to use regex – zdim Nov 12 '17 at 21:46
  • [You can't parse HTML with a regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Dave Cross Nov 13 '17 at 08:56

2 Answers2

4

There is a HTML::TableExtract module in CPAN, which simplifies the problem you are trying to solve:

use strict;
use warnings;
use HTML::TableExtract qw(tree);

my $te = HTML::TableExtract->new( headers => qw(Term Meaning) );
my $html_file = "Glossary";
$te->parse_file($html_file);
my $table = $te->first_table_found;
# ... 
Miguel Prz
  • 13,718
  • 29
  • 42
  • Thank you and I'm sure TableExtract is the better way of doing it, but the object of my question was to improve my understanding of how to use regular expressions since they're so central to Perl. Adding gs to the regexpr as someone suggested (since deleted) was the leg-up I needed. – pleriche Nov 14 '17 at 10:11
  • I see your point, and it's really important build a solid knowlegment on regexpr. But, like other people have said, it's not a goot idea apply regexpr to parsing html documents – Miguel Prz Nov 14 '17 at 11:36
1

You already have answers explaining why you shouldn't parse HTML with regexes. And you really shouldn't. But you've asked for an explanation of why your code doesn't work. So here goes...

You have two problems in your code. One stops it working and the other stops it working as you expect.

Firstly, you are using . in your regex to match any character. But . doesn't match any character. It matches any character except a newline. And you have newlines in your string. You fix that by adding the /s option to your match operator (so it has /gs instead of /s).

With that fix in place, you get a result from your code. Using your test data, I see:

<td><b>Term</b>
         </td>:<td><b>Meaning</b>
         </td>

Which is correct. But looking at your test data, I wondered why I wasn't getting two results - because of the /g. I soon realised it was because your test data is missing the closing </td>. When I added that, I got this result:

<td><span id="0-Day">0-Day</span>
         </td>:<td>
         <p>See <a href="#Zero_Day">Zero Day</a>.
         </p>
         </td>

Ok. It's now finding the second result. But what has happened to the first one? That's the second error in your code.

You have .* a few times in your regex. That means "zero or more of any character". But it's the "or more" that is a problem here. By default, Perl regex qualifiers (* or +) are greedy. That means they will use up as much of the string as possible. And the first .* in your regex is eating up a lot of your string. All of it up to the second <tr> in fact.

The solution to that is to make the .* non-greedy. And you do that by adding ? to the end. So you can replace all of the .* with .*?. Having done that, I get this output:

<td><b>Term</b>
         </td>:<td><b>Meaning</b>
         </td>
<td><span id="0-Day">0-Day</span>
         </td>:<td>
         <p>See <a href="#Zero_Day">Zero Day</a>.
         </p>
         </td>

Which seems correct to me.

So, to summarise:

  1. By default, . doesn't match newlines. To do that, you need /s.
  2. Beware of greedy qualifiers.
Dave Cross
  • 68,119
  • 3
  • 51
  • 97