fine Stack Overflow folks. I'm trying to get a Perl array of the files to which an HTML file links. I'm still pretty new to Perl and I'm largely unfamiliar with HTML, so please bear with me. Some of the files are marked with an asterisk (*), outside of the link text, indicating that the file is regularly updated. I only want to extract links to files which are regularly updated. The HTML file looks like this:
<tr>
<td height="34" nowrap width="170">
<a href="/Files/link1.pdf">Link 1</a>*</td>
</tr>
<!--
<tr>
<td height="34" nowrap width="170">
<a href="/Files/link2.pdf">Link 2</a>*</td>
</tr>
-->
<tr>
<td height="34" nowrap width="170">
<a href="/Files/link3.pdf">Link 3</a>
*</td>
</tr>
<tr>
<td height="34" nowrap width="170">
<a href="/Files/link4.pdf">Link 4</a></td>
</tr>
So what I want in my array is the URL for links 1 and 3, which are marked as updating with an asterisk, but not 2 because it's in a comment and not 4 because it has no asterisk. I tried the following based on the accepted answer to this question:
use strict;
use warnings;
use WWW::Mechanize;
my $page = "file://server/web/site.htm";
my $mech = WWW::Mechanize->new();
$mech->get($page);
my @links = $mech->links();
my @urls;
for my $lnk (@links) {
push(@urls, $lnk->url);
}
I still get link #2 even though it's in a comment. Also, I'm not sure where to begin with only push
ing the asterisked links, especially since the asterisk for link #3 is on a new line. I originally tried this using regular expressions and without using WWW::Mechanize, but I was unable to get the asterisk on the next line.
use strict;
use warnings;
my $html = do {
local $/ = undef;
open(my $fh, "<", "file") || die $!;
<$fh>;
};
$html =~ s/(<!--)+.*(-->)+//;
my @urls = ($html =~ /\bhref[ ]?=[ ]?"([^"]+)".*\*/gc);
This will get links 1 and 2, but not 3. This gets links within comments because apparently my find and replace regex isn't working as I expect it to.
So how to I only get the starred links and skip the commented ones? I'm open to any ideas at all--perhaps my approach from the get go was incorrect. Any help, insight, or direction would be fantastic. Thank you all so much!