0

fine Stack Overflow folks. I'm trying to get a Perl array of the files to which an HTML file links. I'm still pretty new to Perl and I'm largely unfamiliar with HTML, so please bear with me. Some of the files are marked with an asterisk (*), outside of the link text, indicating that the file is regularly updated. I only want to extract links to files which are regularly updated. The HTML file looks like this:

<tr>
    <td height="34" nowrap width="170">
    <a href="/Files/link1.pdf">Link 1</a>*</td>
</tr>

<!--
<tr>
    <td height="34" nowrap width="170">
    <a href="/Files/link2.pdf">Link 2</a>*</td>
</tr>
-->

<tr>
    <td height="34" nowrap width="170">
    <a href="/Files/link3.pdf">Link 3</a>
    *</td>
</tr>

<tr>
    <td height="34" nowrap width="170">
    <a href="/Files/link4.pdf">Link 4</a></td>
</tr>

So what I want in my array is the URL for links 1 and 3, which are marked as updating with an asterisk, but not 2 because it's in a comment and not 4 because it has no asterisk. I tried the following based on the accepted answer to this question:

use strict;
use warnings;
use WWW::Mechanize;

my $page = "file://server/web/site.htm";

my $mech = WWW::Mechanize->new();
$mech->get($page);

my @links = $mech->links();
my @urls;

for my $lnk (@links) {
    push(@urls, $lnk->url);
}

I still get link #2 even though it's in a comment. Also, I'm not sure where to begin with only pushing the asterisked links, especially since the asterisk for link #3 is on a new line. I originally tried this using regular expressions and without using WWW::Mechanize, but I was unable to get the asterisk on the next line.

use strict;
use warnings;

my $html = do {
    local $/ = undef;
    open(my $fh, "<", "file") || die $!;
    <$fh>;
};

$html =~ s/(<!--)+.*(-->)+//;

my @urls = ($html =~ /\bhref[ ]?=[ ]?"([^"]+)".*\*/gc);

This will get links 1 and 2, but not 3. This gets links within comments because apparently my find and replace regex isn't working as I expect it to.

So how to I only get the starred links and skip the commented ones? I'm open to any ideas at all--perhaps my approach from the get go was incorrect. Any help, insight, or direction would be fantastic. Thank you all so much!

Community
  • 1
  • 1
Alex A.
  • 5,466
  • 4
  • 26
  • 56
  • 3
    [Regex is probably not the way to go](http://stackoverflow.com/a/1732454/176646)...use a [real HTML parser](http://search.cpan.org/dist/HTML-Parser/Parser.pm). – ThisSuitIsBlackNot Dec 20 '13 at 20:16
  • 2
    The pony, it comes... – everton Dec 20 '13 at 20:21
  • CPAN has a fairly complete [HTML parser](http://search.cpan.org/dist/HTML-Parser/Parser.pm). Use it. – Jim Garrison Dec 20 '13 at 20:21
  • @ThisSuitIsBlackNot: The answer on the page about using regex to parse HTML given by Kaitlin Duck Sherwood describes my situation more accurately. I am not parsing arbitrary HTML; I have a limited, known set of HTML that is formatted as in my example. The accepted answer to this question works perfectly in my situation. – Alex A. Dec 20 '13 at 20:42
  • 1
    Fair enough. However, that is probably not the case for many of the users who will land on this page in the future. They deserve a warning about the possible pitfalls of the regex approach (although the specific answer I linked to is merely a humorous expression of that). – ThisSuitIsBlackNot Dec 20 '13 at 20:45
  • @ThisSuitIsBlackNot: Agreed. +1 for pointing that out. Can you provide an answer that uses an alternative method suitable for more widely applicable cases? I have had no luck using HTML::Parser so far. – Alex A. Dec 20 '13 at 21:37
  • I will work on an answer, although I probably won't be able to post until next week. – ThisSuitIsBlackNot Dec 20 '13 at 22:11
  • @ThisSuitIsBlackNot: That sounds great, but you don't have to do the work for me. :) If you can provide any sort of outline of how HTML::Parser would be used then I can develop it from there and post a complete answer once I'm finished. – Alex A. Dec 20 '13 at 22:22

2 Answers2

2

In my example, an asterisk denotes a file that is regularly updated and the asterisks live within the td tags. I have determined how to extract these files using HTML::TokeParser.

use strict;
use warnings;
use HTML::TokeParser;

my $html = HTML::TokeParser->new("file.html");

my @urls;

while(my $td = $html->get_tag("td")) {
    my $txt = $html->get_trimmed_text("/td");
    my $url = $html->get_tag("a")->[1]{href};
    if ($txt =~ /\*/) {
        push(@urls, $url);
    }
}

Thank you to @sabujhassan for your working solution and thank you @ThisSuitIsBlackNot for encouraging me to pursue a more generally applicable solution.

Alex A.
  • 5,466
  • 4
  • 26
  • 56
1

Based on your example, it should work.

$html =~ s/<!--.*?-->//sg;
my @urls = ($html =~ /\bhref\s*=\s*"([^"]*)"[^>]*>[^<]*<\/a>\s*\*/sg);
## my @urls = ($html =~ /<a\s+[^>]*href\s*=\s*"([^"]*)"[^>]*>[^<]*<\/a>\s*\*/sg);
Sabuj Hassan
  • 38,281
  • 14
  • 75
  • 85