2

I am trying to grab the values between two delimiters in Perl using regex. I am opening a file and using chomp to go through the file line by line. Example of how the file looks:

"This is <tag> an </tag> example
of the <tag> file </tag> that I
am <tag> trying </tag> to <tag> parse </tag>"

I am able to get the first couple of words: "an", "file", but on the third line I can only get "trying" and not "parse". This is the code I am trying to use:

while (chomp($line = <$filename>)){
   ($tag) = $line =~ m/<tag>(.*?)<\/tag>/;
   push(@tagarray, $tag);
}

I suspect this has something to do with chomp but don't see how to parse the file differently.

Natalie
  • 447
  • 1
  • 4
  • 16
  • I normally use [HTML::TreeBuilder](http://search.cpan.org/~kentnl/HTML-Tree-5.07/lib/HTML/TreeBuilder.pm) (for HTML) – zdim Nov 07 '17 at 17:11
  • 2
    If you're processing HTML or XML then you should use a library specifically for that purpose, rather than trying to create your own using regex patterns. – Borodin Nov 07 '17 at 18:32

2 Answers2

8

You need to modify the regex to grab multiple matches:

my @tags = $line =~ m/<tag>(.*?)<\/tag>/g;

You may be better off using an HTML parser to perform this operation. Parsing HTML with regular expressions is fraught with peril. For example, take a look at HTML::TagParser:

my $html = HTML::TagParser->new(<<'EOF');
This is <tag> an </tag> example
of the <tag> file </tag> that I
am <tag> trying </tag> to <tag> parse </tag>
EOF

my @tags = $html->getElementsByTagName('tag');
my @tagarray = map { $_->innerText() } @tags;
mwp
  • 8,217
  • 20
  • 26
7

I suspect this has something to do with chomp

No. It is because you are capturing only one value and assigning it to a scalar.

Make the regex global (/g) and store the results in an array.

#!/usr/bin/env perl

use strict;
use warnings;
use v5.10;

my $line = "am <tag> trying </tag> to <tag> parse </tag>";
my @tags;
(@tags) = $line =~ m/<tag>(.*?)<\/tag>/g;
say join ",", @tags;
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335