Parsing html comments using Perl split function

Question

I have a split function which splits strings in an .txt document based on spacing and special characters and converts them to lowercase, in order to count the total number of words present in the document. I'm now trying to extend the regular expression so that entire html comments including all words within them are treated as delimiters, but I can't quite get the updated regex to work correctly.

my @words = split /(?:([_\W\s\d]|(<(\w+)>.*<\/\>)))+/, $text;
 #count strings
  %count = ();
  foreach $word (@words) {
    @count{map lc, @keys} =
    map lc, delete @count{@keys = keys %count};
    $count{$word}++;
  }
   foreach $key (keys %count) {
    print $key, $count{$key};
   }

At present the first charcter class

 [_\W\s\d]+

worked fine, but I cant get the second

 |(<(\w+).*\/\>)+

to function correctly, when used together, the second character class doesnt function correctly and whitespacing is counted as a word. ideally the desired output should split words between spacing and special characters and also split html comments (effectively ignoring any words between comment tags)

I'm not sure whether i'm able to use two character classes in a split function or not? still getting to grips with regex!

Parsing HTML with a regex is doomed to failure. Please don't do that. Use an HTML parser instead. — Dave Cross, Mar 03 '19 at 20:07
Is it possible though? if use of a regex is absolutely necessary? — Raznok, Mar 03 '19 at 21:19
Using a regex to "parse" XML; HTML, etc. is [futile](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). See [HTML::TreeParser](https://metacpan.org/pod/HTML::TreeBuilder#store_comments) how to parse a HTML document and retaining the comments plus [HTML::Element](https://metacpan.org/pod/HTML::Element#Comment-pseudo-elements) how to access them. — Stefan Becker, Mar 03 '19 at 21:24
I believe that some of the non-standard extensions to Perl regexes mean that it *is* possible. But it would be a horrible, huge, unmaintainable regex and it would take days to develop and test. It is never "absolutely necessary". There are always alternatives. — Dave Cross, Mar 03 '19 at 21:24

score 0 · Answer 1 · answered Mar 03 '19 at 21:29

Since you said you are parsing a .txt document (with embedded HTML comments) you could try Regexp::Grammars. Here is a starting point:

use strict;
use warnings;
use Regexp::Grammars;

my $parser = qr{   
          <nocontext:>
          <words>
          <token: words> (?:(?:<[word]><[separator]>?)|(?:<[separator]><[word]>?))+
          <token: word> <.wordchar>+
          <token: separator> <.comment> | (?:(?:(?!<.comment>)(?!<.wordchar>)).)+
          <token: wordchar> [a-zA-Z]
          <token: comment> \< <.wordchar>+ \> [^<]* \</\>
}sx;

my $fn = 'file.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $text = do { local $/; <$fh> };
close $fh;

if ($text =~ $parser) {
    for my $word (@{ $/{words}{word} } ) {
        print "'", $word, "'\n";
    }
}

Parsing html comments using Perl split function

1 Answers1