0

I'm attempting to grab all the links from HTML files stored locally and build a hash I'm using File::Find to get the html files but have left that out of the code.

  1. The first hash key will be the title
  2. the second key the mirror the
  3. third key the part then the url

like

$hash{$title}{$mirror}{$part}=$url;

i can get the links that have single parts & single mirrors but i'm not getting the multiple parts currently I'm stuck in a loop. I'm getting the mirror by pattern matching the url but How do i the get the part if it exists else $part = "part_1" i then need to move onto the next url

#!/usr/bin/perl

my $Html = qq(
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html>
      <head>
      <meta http-equiv="content-type" content="text/html; charset=windows-1250">
      <meta name="generator" content="PSPad editor, www.pspad.com">
      <title>First hash key</title>
      </head>
      <body>
      <div>
       <br><b>Multi Links</b><br><br><!--colorstart:#FF0000-->

       <span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend-->
       <br><a href="http://mirror1.com/rvvaq1hi" target="_blank"><b>Part 1</b></a>
       <br><a href="http://mirror1.com/w33h9ym2" target="_blank"><b>Part 2</b></a>
       <br><a href="http://mirror1.com/fdnppn15" target="_blank"><b>Part 3</b></a></div>

        </div>
      <div>
      <br><b>Single link multiple mirrors</b><br>
      <br><a href="http://mirror1.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend--></a></div>
        <br><a href="http://mirror2.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 2</b><!--colorend--></span><!--/colorend--></a></div>    

        </div>

      </body>
    </html>
);
my @html = split(\n,$Html);
    my $TheMain;
    my $Title;
    my @Names=(Mirror1,Mirror2,Mirror3);
    my %hash;

      foreach my $line (@html)
        {
        print "Da Line [$line]\n";
        if ($line =~ m{<title>(.*?)</title>} )
          {
           $Title = $1;
           print "$Title\n";
          }
         $line =~ s/\"/'/g;   # Double quotes to single
         $line=~ s{\n}{}g;  #remove \n
         $line=~ s{\s+}{ }g;#remove excessive spaces

          $TheMain = $TheMain . $line;
        }
        print "$TheMain\n";
     unless ($TheMain eq "") # unless empty enter the loop
       {
        while ($TheMain =~ m{a href=(.*?)/a}) 
        {
            my $A = $1;
            print "the A  $A\n";  ## stuck in a loop
            my ($url,$part);
            $A =~ s/<.*?color.*?>//ig;
            while ($A =~ m{\'(http.*?)\'.*?<b>(.*?)</b> }gi)
              {
               $url = $1;
               $part = $2;
               if ($part =~m/part/i)
                {
                  $part =~ s/ /_/;
                }
               else
                {
                  $part = "part_1";
                } 
              }

           foreach my $mirror (@NAMES)   # fillters out unwanted links
            {
              if ($url =~/$mirror/i)
                {
                  $hash{$Title}{$mirror}{$part}=$url;
                }
            }
          }
        }

for my $Title (sort keys %hash) 
  {
    for my $Host  (sort keys %{$hash{$Title}})
      {

          for my $part (sort keys %{$hash{$Title}{$Host}})
            {

               my $url = $hash{$Title}{$Host}{$part};
               print "$Title,$url\n";
             } 
      }
     }    
Holly
  • 307
  • 1
  • 8
  • 17

1 Answers1

0

See this comprehensive answer to the general question of "How do I parse HTML with regular expressions?"

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Sue Mynott
  • 1,287
  • 1
  • 9
  • 14