0

I am trying to extract the content of a text file between two tags and store it to another file.

I manage to convert the input file to a multiple line string variable then use regexp successfully to get what I want in the variable.

But I fail writing my variable to a file, I assume this is because of the type of string with multiple \n inside.

I would appreciate any help. (This my first Perl Script…)

For the test, I use a index.html file but can be any text file.

Edit : solved, see correction in comments

Here below my documented code :

# Extract string between two tags

use strict;
use warnings;

my $inputfile = "";
my $outputfile = "";

# Parse Macro Arguments arguments
if(@ARGV < 2)
{
    print "Usage: perl Macro_name.pl <inputfile.HTML> <outfile.HTML>\n";
    exit(1);
}

$inputfile = $ARGV[0];
$outputfile = $ARGV[1];


my $body="";

# Convert input file to multiple line string #
$body = File_to_Var_Multi_Line($inputfile);

# First tag & Second tag match
if ( $body =~ /(.*)<body(.*?)>(.*)<\/body>/s )     
{                                     # error :
    my $body = $3;                    # $body is local here
                                      # correction :
    #Print to check if extract ok     # declare another variable outside if
    print $body, "\n";
}

# Write to file my match multiple line string #
open(my $fh_body, '>:encoding(UTF-8)', $outputfile) 
or die "Could not open file '$outputfile' $!";

print $fh_body "$body\n";

close $fh_body;

# sub #
sub File_to_Var_Multi_Line
{
    if(@_ < 1)
    {
        print "Usage: line=File_to_Var_Multi_Line<file>\n";
        exit(1);
    }

    my $inputfile_2 = "";
    $inputfile_2 = $_[0];

    open(my $fl_in, '<:encoding(UTF-8)', $inputfile_2)
    or die "Could not open file '$inputfile_2' $!";

    my $line = "";
    my $row_2 = "";

    while (my $row_2 = <$fl_in>)
    {
        $line .= $row_2;
    }
    return $line
}

And the input test file :

<html>
<body>
<a href="page1.html">page 1</a><br>
<a href="page2.html">page 2</a><br>
<a href="page3.html">page 3</a><br>
<a href="page4.html">page 4</a><br>
<a href="page5.html">page 5</a><br>
</body>
</html>
Bin31
  • 39
  • 1
  • 6
  • you don't really explain whats wrong in writing to your output file other than saying i fail writing my variable to a file. also regex isnt all that great at parsing HTML. You might want to use an html parser. – Chris Doyle Jan 29 '15 at 15:29
  • I fixed it and your post helped me though when asking what's wrong… I declared locally an new $body in the "if" whereas one was existing outside… so my print was printing the one outside the "if" not the one coming inside the "if"... – Bin31 Jan 29 '15 at 18:00

1 Answers1

0

Notwithstanding RegEx match open tags except XHTML self-contained tags

You may find useful the 'range operator' for iterating through a file.

For example:

while ( <$fl_in> ) {
    if ( m,<BODY>,i .. m,</BODY>,i ) { 
        print;
    }
}

The condition will be true, if you're within the 'body' tags. (Although it's line oriented, so trailing stuff will be 'caught').

Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • I used an HTML file for practice but the files i really need to parse is not HTML. I need an independent method. I will have a look at the range operator that could be better than regexp. Thnks. – Bin31 Jan 29 '15 at 16:03