Collecting links from a HTML & Bulding a perl Hash

Question

I'm attempting to grab all the links from HTML files stored locally and build a hash I'm using File::Find to get the html files but have left that out of the code.

The first hash key will be the title
the second key the mirror the
third key the part then the url

like

$hash{$title}{$mirror}{$part}=$url;

i can get the links that have single parts & single mirrors but i'm not getting the multiple parts currently I'm stuck in a loop. I'm getting the mirror by pattern matching the url but How do i the get the part if it exists else $part = "part_1" i then need to move onto the next url

#!/usr/bin/perl

my $Html = qq(
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html>
      <head>
      <meta http-equiv="content-type" content="text/html; charset=windows-1250">
      <meta name="generator" content="PSPad editor, www.pspad.com">
      <title>First hash key</title>
      </head>
      <body>
      <div>
       <br><b>Multi Links</b><br><br><!--colorstart:#FF0000-->

       <span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend-->
       <br><a href="http://mirror1.com/rvvaq1hi" target="_blank"><b>Part 1</b></a>
       <br><a href="http://mirror1.com/w33h9ym2" target="_blank"><b>Part 2</b></a>
       <br><a href="http://mirror1.com/fdnppn15" target="_blank"><b>Part 3</b></a></div>

        </div>
      <div>
      <br><b>Single link multiple mirrors</b><br>
      <br><a href="http://mirror1.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend--></a></div>
        <br><a href="http://mirror2.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 2</b><!--colorend--></span><!--/colorend--></a></div>    

        </div>

      </body>
    </html>
);
my @html = split(\n,$Html);
    my $TheMain;
    my $Title;
    my @Names=(Mirror1,Mirror2,Mirror3);
    my %hash;

      foreach my $line (@html)
        {
        print "Da Line [$line]\n";
        if ($line =~ m{<title>(.*?)</title>} )
          {
           $Title = $1;
           print "$Title\n";
          }
         $line =~ s/\"/'/g;   # Double quotes to single
         $line=~ s{\n}{}g;  #remove \n
         $line=~ s{\s+}{ }g;#remove excessive spaces

          $TheMain = $TheMain . $line;
        }
        print "$TheMain\n";
     unless ($TheMain eq "") # unless empty enter the loop
       {
        while ($TheMain =~ m{a href=(.*?)/a}) 
        {
            my $A = $1;
            print "the A  $A\n";  ## stuck in a loop
            my ($url,$part);
            $A =~ s/<.*?color.*?>//ig;
            while ($A =~ m{\'(http.*?)\'.*?<b>(.*?)</b> }gi)
              {
               $url = $1;
               $part = $2;
               if ($part =~m/part/i)
                {
                  $part =~ s/ /_/;
                }
               else
                {
                  $part = "part_1";
                } 
              }

           foreach my $mirror (@NAMES)   # fillters out unwanted links
            {
              if ($url =~/$mirror/i)
                {
                  $hash{$Title}{$mirror}{$part}=$url;
                }
            }
          }
        }

for my $Title (sort keys %hash) 
  {
    for my $Host  (sort keys %{$hash{$Title}})
      {

          for my $part (sort keys %{$hash{$Title}{$Host}})
            {

               my $url = $hash{$Title}{$Host}{$part};
               print "$Title,$url\n";
             } 
      }
     }

It is much better to parse and extract data from an HTML using a dedicated HTML parser such as http://search.cpan.org/dist/HTML-Parser/Parser.pm — Bitwise, Sep 26 '12 at 13:41

score 0 · Accepted Answer · edited May 23 '17 at 11:43

0

See this comprehensive answer to the general question of "How do I parse HTML with regular expressions?"

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 11:43

Community

1
1

answered Sep 26 '12 at 14:07

Sue Mynott

1,287
1
9
14

Collecting links from a HTML & Bulding a perl Hash

1 Answers1