Regex to parse html for sentences?

Question

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.

For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!

#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;

open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;

print "<pre>";

###Main Program###
&sentences;

###sentence identifier sub###

sub sentences {
@sentences;
while ($html =~ />[^<]\. /gis) {
    push @sentences, $1;
}
#for debugging, comment out when running    
    print join("\n",@sentences);
}

print "</pre>";

Could you mention what error you run into? What do you mean by you can't get the code to run? — gideon, May 22 '13 at 06:11
I wish I could, the only error I receive is a server error (500) which the server gives me for everything from missing
statements to incorrect syntax to missing brackets — koku, May 22 '13 at 06:20

score 3 · Answer 1 · edited May 23 '17 at 10:25

3

Your regex should be />[^<]*?./gis

The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.

There may be other problems.

Now read this

edited May 23 '17 at 10:25

Community

1
1

answered May 22 '13 at 06:18

Eli Algranti

8,707
2
42
50

score 2 · Answer 2 · answered May 22 '13 at 06:22

A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)

This does not get all the sentences though, just the first one in each element.

A better way would be to capture all the text, then extract sentences from each fragment

while( $html=~ m{>([^<]*<}g) { push @text_content, $1}; 
foreach (@text_content) { while( m{([^.]*)\.}gs) { push @sentences, $1; } }

(untested because it's early in the morning and coffee is calling)

All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.

thank you for the explanation, i understand how the regexes are formed much more clearly now! — koku, May 22 '13 at 06:40

score 0 · Answer 3 · answered May 22 '13 at 11:46

I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).

#!/usr/bin/perl

 use strict;
 use warnings;
 use HTML::Grabber;

 my $file_location = shift;
 print "\n\nfile: $file_location";
 my $totalWordCount = 0;
 my $sentenceCount = 0;
 my $wordsInSentenceCount = 0;
 my $averageWordsPerSentence = 0;
 my $char_count = 0;
 my $contents;
 my $rounded;
 my $rounded2;

 open ( my $file, '<', $file_location  ) or die "cannot open < file: $!";

    while( my $line = <$file>){
          $contents .= $line;
  }      
 close( $file );
 my $dom = HTML::Grabber->new( html => $contents );

 $dom->find('p')->each( sub{
    my $p_tag = $_->text;

    ++$totalWordCount while $p_tag =~ /\S+/g;


    while ($p_tag =~ /[.!?]+/g){
              $p_tag =~ s/\s//g;
              $char_count += (length($p_tag));
              $sentenceCount++;  
          }
     });     


           print "\n Total Words: $totalWordCount\n";
           print " Total Sentences: $sentenceCount\n";
           $rounded = $totalWordCount / $sentenceCount;
           print  " Average words per sentence: $rounded.\n\n";
           print " Total Characters: $char_count.\n";
           my $averageCharsPerWord = $char_count / $totalWordCount  ;

           $rounded2 = sprintf("%.2f", $averageCharsPerWord );

           print  " Average words per sentence: $rounded2.\n\n";

Regex to parse html for sentences?

3 Answers3