0

this is my problem I'm trying to read an HTML file(index.html) then search all links an put it on a second file named salida.html, I read this answer, I read this answer and I tried to do it, but it didn't work for me. This is my perl code:

use strict;
use warnings;
use 5.010;
use Tie::File;

my $entrada='index.html';
my $salida='salida.html';
open(A,"<$entrada");
my @links;  
foreach my $linea (<A>){
    print "Renglon => $linea\n" if $linea =~ m/a href/;
    #print $B $linea if $linea =~ m/a href/;
    push @links, $linea if $linea =~ m/a href/;
}

tie my @resultado, 'Tie::File', 'salida.html' or die "Nelson";
for (@resultado) {
    if ($_ =~ m/<main class="contenido">/){
        foreach my $found (@links){
            $_ .= '<br/>'.$found;
        }
        last;
    }
}
close(A);

My Perl code runs without problems but in the for of my code I'm trying to write the links that I have in my variable $links in a specific part of my salida.html file:

<!DOCTYPE html>
<html lang="es-mx">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Resultados de la busqueda</title>
    <link rel="stylesheet" href="style-salida.css">
</head>

<body>
    <div class="contenedor">
        <header class="header">
            <h2>Resultados de la busqueda</h2>
        </header>
        *<main class="contenido">

        </main>*
        <footer class="footer">
            <h4>
                Gerardo Saucedo Arevalo - 15092087 - Topicos selectos de tecnologias web - Búsqueda de enlaces dentro de
                una página web
            </h4>
        </footer>
    </div>
</body>

</html>

But my code always add the lines at the end of the file, I ran this code once and it worked perfectly, but then I add some lines and when I tried to run one more time didn't work. I restored my file at the moment when it worked but it does not work anymore. What I'm doing wrong?

Stefan Becker
  • 5,695
  • 9
  • 20
  • 30
  • Your example seems incomplete: `index.html` is missing (I guess the included HTML is `salida.html`. Furthermore: always use a HTML parser, e.g. [HTML::TreeBuilder](https://metacpan.org/pod/HTML::TreeBuilder) to parse the HTML and then operate on the DOM instead. – Stefan Becker Mar 18 '19 at 07:07
  • FYI: [never use a regex to parse HTML/XML/...](https://stackoverflow.com/questions/1732348#1732454) – Stefan Becker Mar 18 '19 at 07:50

1 Answers1

0

Always process HTML or XML with an appropriate parser and then implement your processing on the DOM. My solution uses HTML::TreeBuilder. As your question doesn't include the contents of index.html I have appended my own to the solution:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;

# Extract links from <DATA>
my $root1 = HTML::TreeBuilder->new->parse_file(\*DATA)
    or die "HTML: $!\n";

my @links = $root1->look_down(_tag => 'a');

# Process salida.html from STDIN
my $root2 = HTML::TreeBuilder->new;
$root2->ignore_unknown(0);
$root2->parse_file(\*STDIN)
    or die "HTML: $!\n";

# insert links in correct section
if (my @nodes = $root2->look_down(class => 'contenido')) {
    $nodes[0]->push_content(@links);
}

print $root2->as_HTML(undef, '  '), "\n";

# IMPORTANT: must delete manually
$root2->delete;
$root1->delete;

exit 0;

__DATA__
<!DOCTYPE html>
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <div>
      <a href="link1.html">Link 1</a>
      <a href="link2.html">Link 2</a>
    </div>
  </body>
</html>

Test run:

$ perl dummy.pl <dummy.html
<!DOCTYPE html>
<html lang="es-mx">
...
 <main class="contenido"> <a href="link1.html">Link 1</a><a href="link2.html">Link 2</a></main> 
...
</html>
Stefan Becker
  • 5,695
  • 9
  • 20
  • 30
  • that works fine for me, now I'm trying to save the result in the salida.html file. I did this to open the file: open (RES,">$salida") || die ($!); and I overwrite it with: print RES $res; Where $res= $root2->as_HTML(undef, ' '), "\n";. I'm doing it fine? – KillerBee GSA Mar 18 '19 at 18:51
  • Then you can't read the file from STDIN, instead you should read from the file directly with `->parse_file("salida.html")` (the same method also accepts file names), Please do not use the insecure 2-parameter open and old-style filehandles, use `open(my $ofh, '>', $salida); print $ofh $root2->as_HTML(); close($ofh);` – Stefan Becker Mar 18 '19 at 20:54