1

I'm attempting to parse an html file and i have a regular expression that captures patterns inside all p tags. for some reason it's only printing out the first instance found.

my @newH2Array = ("Part I", "Part II", "Part III");
my $linenumber = 0;
while (my $line = <$parser>){
    chomp $line;
    $linenumber++;
    if($line =~ /^<p>/){
        if($line =~ /(Part [IVX]+)/gi) {
            if (grep{ lc $_ eq lc $1 } @newH2Array){
                print "found a hit <" . $1 . "> that matches array element on line" . $linenumber;
            }
        }
    }
}

When i run it with this test below it would only print out Part I but not the other 3. When i switched the if statements to a while loop it doesn't work as either. Can anyone tell me what i'm doing wrong here?

<p>Part I should be found. Part II should be found also. Part III should be found.</p>

The result should be.

found a hit <Part I> that matches array element on line 1
found a hit <Part II> that matches array element on line 1
found a hit <Part III> that matches array element on line 1
Eric
  • 954
  • 2
  • 14
  • 23
  • 1
    Is there a reason you are not using an HTML parser? [For your own good, you shouldn't parse XML with regex](https://stackoverflow.com/q/1732348/1331451). – simbabque Sep 29 '17 at 13:45
  • 1
    because i'm doing a lot of edits and i'm actually replacing the < with unicodes so i can display the full code with highlighting once the script finishes. an HTML parser would not be able to read it once i'm done with modifying it. – Eric Sep 29 '17 at 13:46
  • It shouldn't have to read it once you're done with it though, only at the start when it's more-or-less valid HTML. That doesn't mean you would have to produce valid HTML as an output. – Aaron Sep 29 '17 at 13:50
  • Your code is missing `$linenumber`. How does that start? `0` or `1`? `@newH2Array` is also missing. Please [edit] and provide a [mcve]. Also, you do not need to escape angle brackets `<>` in patterns, they do not have special meaning. – simbabque Sep 29 '17 at 13:51
  • @simbabque i edited the post with more information. – Eric Sep 29 '17 at 14:03
  • You do realize that _Part_ doesn't match _Chapter_? :D – simbabque Sep 29 '17 at 14:03
  • @simbabque Oops, here's the full regex im using in my code ([a-z]+)\s(part \d+[a-z]?\.?[\d+]?|part [IVX]+|annex [a-z]+|appendix [A-Z\d][\-\.]?[\d+]?)([\w\s\-]+) Part is covered in this case. i was typing up a copy of the code. – Eric Sep 29 '17 at 14:05

2 Answers2

3

An if statement is a binary choice. It either matches or it doesn't. For a loop, you need a looping construct - like while.

I've also used say() instead of print(), Perl's built-in $. instead of $linenumber and I've interpolated variables in strings.

Oh, and switched to <DATA> to make it easy to test.

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

my @newH2Array = ("Part I", "Part II", "Part III");
my $linenumber = 0;
while (my $line = <DATA>){
    chomp $line;
    if ($line =~ /^<p>/){
        while ($line =~ /(Part [IVX]+)/gi) {
            if (grep{ lc $_ eq lc $1 } @newH2Array){
                say "found a hit <$1> that matches array element on line $.";
            }
        }
    }
}

__DATA__
<p>Part I should be found. Part II should be found also. Part III should be found.</p>
Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • So funny how our answers are again almost identical. So much for TIMTOWTDI. :) – simbabque Sep 29 '17 at 14:11
  • 1
    @simbabque: "Great minds think alike" or perhaps "Fools seldom differ" :-) – Dave Cross Sep 29 '17 at 14:12
  • okay so i copy and pasted your code into a new pl script to test and it works perfectly. when i replaced it with my regex it starts failing. so i think my issue is with the regex. also thanks for letting me know about perl's built in $. i started learning perl and regex this week so there's still alot to learn. – Eric Sep 29 '17 at 15:07
  • yeah i got it working i definately do need the while loop, i took it out cause i assumed the g flag might do it but i guess not. and the regex i have was incorrect, after i fixed it i was able to get all the patterns. thanks guys really appreciate the help. – Eric Sep 29 '17 at 15:17
2

You are using a /g match, but you are only taking the first of its matches because of the if. You need to iterate all the matches. One way to do that is with a while loop.

my @newH2Array = ("Part I", "Part II", "Part III", "Part X");

while (my $line = <DATA>){
    chomp $line;

    if($line =~ /^<p>/){
        while ($line =~ /(Part [IVX]+)/gi) {
            if (grep{ lc $_ eq lc $1 } @newH2Array){
                print "found a hit <$1> that matches array element on line $.\n";
            }
        }
    }
}

__DATA__
<p>Part I should be found. Part II should be found also. Part III should be found.</p>
<p>Part X should be found. Particles are fun.</p>

Note that I removed $linenumber. You can just use $., which is always the current line number of the last filehandle read.

Here's the output.

found a hit <Part I> that matches array element on line 1
found a hit <Part II> that matches array element on line 1
found a hit <Part III> that matches array element on line 1
found a hit <Part X> that matches array element on line 2
simbabque
  • 53,749
  • 8
  • 73
  • 136